AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
We have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
1.To predict whether a liability customer will buy a personal loan or not.
2.Which variables are most significant.
3.Which segment of customers should be targeted more.
This dataset contains the information of the AllLife Bank's liability customer data
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit from the number of displayed columns and rows.
# This is so I can see the entire dataframe when I print it
pd.set_option("display.max_columns", None)
# pd.set_option('display.max_rows', None)
pd.set_option("display.max_rows", 200)
# To build linear model for statistical analysis and prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
# To build sklearn model
from sklearn.linear_model import LogisticRegression
# To get different metric scores
from sklearn import metrics
from sklearn.metrics import f1_score,accuracy_score, recall_score, precision_score, roc_auc_score, roc_curve, confusion_matrix, precision_recall_curve
# To build Decision Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
#Import the data source csv file as a data frame
data = pd.read_csv('Loan_Modelling.csv')
#Make a copy to avoid any changes to the original data
loan_data = data.copy()
#Print the first five rows of the dataset
print(loan_data.head())
#Print the last five rows of the dataset
print(loan_data.tail())
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage \
0 1 25 1 49 91107 4 1.6 1 0
1 2 45 19 34 90089 3 1.5 1 0
2 3 39 15 11 94720 1 1.0 1 0
3 4 35 9 100 94112 1 2.7 2 0
4 5 35 8 45 91330 4 1.0 2 0
Personal_Loan Securities_Account CD_Account Online CreditCard
0 0 1 0 0 0
1 0 1 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 1
ID Age Experience Income ZIPCode Family CCAvg Education \
4995 4996 29 3 40 92697 1 1.9 3
4996 4997 30 4 15 92037 4 0.4 1
4997 4998 63 39 24 93023 2 0.3 3
4998 4999 65 40 49 90034 3 0.5 2
4999 5000 28 4 83 92612 3 0.8 1
Mortgage Personal_Loan Securities_Account CD_Account Online \
4995 0 0 0 0 1
4996 85 0 0 0 1
4997 0 0 0 0 0
4998 0 0 0 0 1
4999 0 0 0 0 1
CreditCard
4995 0
4996 0
4997 0
4998 0
4999 1
#Print the number of rows and columns in dataset
print (loan_data.shape)
(5000, 14)
#Check for null values and duplicates
print (loan_data.isna().sum())
print(loan_data.duplicated().sum())
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64 0
There are no null values in any of the columns.
There are no duplicates in the dataset.
#Check the column datatypes
print(loan_data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB None
We see that ID column does not have any statistical values and hence we will be dropping that column as a part of clean up.
Dependent variable is the Personal_Loan which is of numeric data type.
All the variables are of numeric data type.We will be categorizing few columns soon.
There are no missing values in the dataset.
#Check the summary of dataset
print(loan_data.describe().T)
count mean std min 25% \
ID 5000.0 2500.500000 1443.520003 1.0 1250.75
Age 5000.0 45.338400 11.463166 23.0 35.00
Experience 5000.0 20.104600 11.467954 -3.0 10.00
Income 5000.0 73.774200 46.033729 8.0 39.00
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00
Family 5000.0 2.396400 1.147663 1.0 1.00
CCAvg 5000.0 1.937938 1.747659 0.0 0.70
Education 5000.0 1.881000 0.839869 1.0 1.00
Mortgage 5000.0 56.498800 101.713802 0.0 0.00
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00
CD_Account 5000.0 0.060400 0.238250 0.0 0.00
Online 5000.0 0.596800 0.490589 0.0 0.00
CreditCard 5000.0 0.294000 0.455637 0.0 0.00
50% 75% max
ID 2500.5 3750.25 5000.0
Age 45.0 55.00 67.0
Experience 20.0 30.00 43.0
Income 64.0 98.00 224.0
ZIPCode 93437.0 94608.00 96651.0
Family 2.0 3.00 4.0
CCAvg 1.5 2.50 10.0
Education 2.0 3.00 3.0
Mortgage 0.0 101.00 635.0
Personal_Loan 0.0 0.00 1.0
Securities_Account 0.0 0.00 1.0
CD_Account 0.0 0.00 1.0
Online 1.0 1.00 1.0
CreditCard 0.0 1.00 1.0
ID: It is just a mere customer ID number and will not add any statistical value to our Analysis.
Age: Average age of people in the dataset is 45 years, age has a wide range from 23 to 67 years.
Experience: The average experience for customers in years is 20 years. There are negative values which needs to be treated.
Income: The average Income of the customers is around 73000 USD.There is a difference between minimum and 25th percentile and a vast difference in 75th percentile(98K) and the maximum value(224k), indicates that there might be outliers present in the variable.
ZIPCode: This field will not be used directly for analysis and will be mapped to corresponding cities for further analysis.
Family: The average family size of the customers in dataset is 2.
CCAvg: On average people spend around 1900 dollars a month.A vast difference in the 75th(2500 dollars) percentile and the maximum value(10000 dollars), indicates that there might be outliers present in the variable.
Education: The average education level of the customers are Graduates.The mean and the median almost are equal to 2 indicating symmetry in data.
Mortgage: Around 50% of the mortgage data is zero indicating only half of the customers in the data has a mortgage.A vast difference between 75th percentile(101k) and the maximum value(635k), indicates that there might be outliers present in the variable.
Personal_Loan:This is our dependent variable and 75% of data have zero value indicating a very few customers have got a personal loan from this bank in the last campaign in the given dataset.
Securities_Account: 75% of the observations are 0, indicating very low percentage of customers in our dataset are holding Securities account with the bank.
CD_Account: 75% of the observations are 0, indicating very few customers in our dataset are holding CD account with the bank.
Online: More than 50% of the customers have Online account with this bank.
CreditCard: Around 25% of the customers hold a credit card from other banks as well.
# Assigning the dataframe columns to a variable
num_columns = loan_data.describe(include = 'all').columns
num_columns
Index(['ID', 'Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg',
'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account',
'CD_Account', 'Online', 'CreditCard'],
dtype='object')
for i in num_columns:
print('Unique values in',i, 'are :')
print(loan_data[i].value_counts())
print('*'*50)
Unique values in ID are :
2047 1
2608 1
4647 1
2600 1
553 1
..
3263 1
1218 1
3267 1
1222 1
2049 1
Name: ID, Length: 5000, dtype: int64
**************************************************
Unique values in Age are :
35 151
43 149
52 145
58 143
54 143
50 138
41 136
30 136
56 135
34 134
39 133
59 132
57 132
51 129
60 127
45 127
46 127
42 126
40 125
31 125
55 125
62 123
29 123
61 122
44 121
32 120
33 120
48 118
38 115
49 115
47 113
53 112
63 108
36 107
37 106
28 103
27 91
65 80
64 78
26 78
25 53
24 28
66 24
23 12
67 12
Name: Age, dtype: int64
**************************************************
Unique values in Experience are :
32 154
20 148
9 147
5 146
23 144
35 143
25 142
28 138
18 137
19 135
26 134
24 131
3 129
14 127
16 127
30 126
34 125
27 125
17 125
29 124
22 124
7 121
8 119
6 119
15 119
10 118
33 117
13 117
11 116
37 116
36 114
21 113
4 113
31 104
12 102
38 88
39 85
2 85
1 74
0 66
40 57
41 43
-1 33
-2 15
42 8
-3 4
43 3
Name: Experience, dtype: int64
**************************************************
Unique values in Income are :
44 85
38 84
81 83
41 82
39 81
40 78
42 77
83 74
43 70
45 69
29 67
21 65
35 65
22 65
85 65
25 64
84 63
28 63
30 63
55 61
82 61
78 61
65 60
64 60
32 58
61 57
53 57
80 56
58 55
62 55
31 55
23 54
34 53
18 53
59 53
79 53
54 52
19 52
49 52
60 52
33 51
70 47
52 47
20 47
24 47
75 47
69 46
63 46
50 45
74 45
48 44
73 44
71 43
51 41
72 41
90 38
91 37
93 37
68 35
113 34
89 34
15 33
13 32
14 31
12 30
114 30
92 29
98 28
115 27
11 27
94 26
9 26
112 26
88 26
95 25
141 24
101 24
99 24
128 24
122 24
125 23
129 23
145 23
8 23
10 23
111 22
154 21
134 20
104 20
149 20
105 20
121 20
140 19
130 19
131 19
118 19
110 19
155 19
119 18
123 18
138 18
135 18
180 18
103 18
158 18
132 18
109 18
120 17
179 17
102 16
108 16
139 16
161 16
195 15
152 15
133 15
142 15
191 13
173 13
182 13
164 13
184 12
170 12
124 12
160 12
183 12
175 12
190 11
172 11
150 11
165 11
148 11
153 11
100 10
162 10
188 10
178 10
163 9
143 9
185 9
174 9
171 9
181 8
194 8
168 8
144 7
169 7
159 7
193 6
192 6
201 5
151 4
200 3
198 3
204 3
199 3
203 2
189 2
202 2
205 2
224 1
218 1
Name: Income, dtype: int64
**************************************************
Unique values in ZIPCode are :
94720 169
94305 127
95616 116
90095 71
93106 57
...
94970 1
92694 1
94404 1
94598 1
94965 1
Name: ZIPCode, Length: 467, dtype: int64
**************************************************
Unique values in Family are :
1 1472
2 1296
4 1222
3 1010
Name: Family, dtype: int64
**************************************************
Unique values in CCAvg are :
0.30 241
1.00 231
0.20 204
2.00 188
0.80 187
0.10 183
0.40 179
1.50 178
0.70 169
0.50 163
1.70 158
1.80 152
1.40 136
2.20 130
1.30 128
0.60 118
2.80 110
2.50 107
0.90 106
0.00 106
1.90 106
1.60 101
2.10 100
2.40 92
2.60 87
1.10 84
1.20 66
2.70 58
2.30 58
2.90 54
3.00 53
3.30 45
3.80 43
3.40 39
2.67 36
4.00 33
4.50 29
3.90 27
3.60 27
4.30 26
6.00 26
3.70 25
4.70 24
3.20 22
4.10 22
4.90 22
3.10 20
6.50 18
5.00 18
5.40 18
0.67 18
2.33 18
1.67 18
4.40 17
5.20 16
3.50 15
6.90 14
7.00 14
6.10 14
4.60 14
7.20 13
5.70 13
7.40 13
6.30 13
7.50 12
8.00 12
4.20 11
6.33 10
6.80 10
8.10 10
7.30 10
0.75 9
1.75 9
6.67 9
4.33 9
7.60 9
6.70 9
1.33 9
8.80 9
7.80 9
8.60 8
4.80 7
5.60 7
5.10 6
5.90 5
7.90 4
5.30 4
6.60 4
5.50 4
5.80 3
10.00 3
6.40 3
4.75 2
8.50 2
4.25 2
8.30 2
5.67 2
6.20 2
9.00 2
3.33 1
8.90 1
4.67 1
3.25 1
2.75 1
8.20 1
9.30 1
3.67 1
5.33 1
Name: CCAvg, dtype: int64
**************************************************
Unique values in Education are :
1 2096
3 1501
2 1403
Name: Education, dtype: int64
**************************************************
Unique values in Mortgage are :
0 3462
98 17
103 16
119 16
83 16
...
541 1
509 1
505 1
485 1
577 1
Name: Mortgage, Length: 347, dtype: int64
**************************************************
Unique values in Personal_Loan are :
0 4520
1 480
Name: Personal_Loan, dtype: int64
**************************************************
Unique values in Securities_Account are :
0 4478
1 522
Name: Securities_Account, dtype: int64
**************************************************
Unique values in CD_Account are :
0 4698
1 302
Name: CD_Account, dtype: int64
**************************************************
Unique values in Online are :
1 2984
0 2016
Name: Online, dtype: int64
**************************************************
Unique values in CreditCard are :
0 3530
1 1470
Name: CreditCard, dtype: int64
**************************************************
Experience column has -1,-2 &-3 values that needs to be treated.
There are no null or special character values in any of the columns.
However there are zero values in some of the columns.
In Experience column zero values indicate that there are non working customers as well in the dataset.
Zero values in Mortgage indicate the customers who has no Mortgage.
Zero in CCAvg indicates that some customers do not use credit card.
Personal_Loan, Securities_Account, CD_Account, Online, Credit Card are categorical columns where 0(False) and 1(True) act as No and Yes. We will be converting those column data types soon.
#Check the count of negative values in Experience column
print(loan_data[loan_data.Experience < 0 ])
ID Age Experience Income ZIPCode Family CCAvg Education \
89 90 25 -1 113 94303 4 2.30 3
226 227 24 -1 39 94085 2 1.70 2
315 316 24 -2 51 90630 3 0.30 3
451 452 28 -2 48 94132 2 1.75 3
524 525 24 -1 75 93014 4 0.20 1
536 537 25 -1 43 92173 3 2.40 2
540 541 25 -1 109 94010 4 2.30 3
576 577 25 -1 48 92870 3 0.30 3
583 584 24 -1 38 95045 2 1.70 2
597 598 24 -2 125 92835 2 7.20 1
649 650 25 -1 82 92677 4 2.10 3
670 671 23 -1 61 92374 4 2.60 1
686 687 24 -1 38 92612 4 0.60 2
793 794 24 -2 150 94720 2 2.00 1
889 890 24 -2 82 91103 2 1.60 3
909 910 23 -1 149 91709 1 6.33 1
1173 1174 24 -1 35 94305 2 1.70 2
1428 1429 25 -1 21 94583 4 0.40 1
1522 1523 25 -1 101 94720 4 2.30 3
1905 1906 25 -1 112 92507 2 2.00 1
2102 2103 25 -1 81 92647 2 1.60 3
2430 2431 23 -1 73 92120 4 2.60 1
2466 2467 24 -2 80 94105 2 1.60 3
2545 2546 25 -1 39 94720 3 2.40 2
2618 2619 23 -3 55 92704 3 2.40 2
2717 2718 23 -2 45 95422 4 0.60 2
2848 2849 24 -1 78 94720 2 1.80 2
2876 2877 24 -2 80 91107 2 1.60 3
2962 2963 23 -2 81 91711 2 1.80 2
2980 2981 25 -1 53 94305 3 2.40 2
3076 3077 29 -1 62 92672 2 1.75 3
3130 3131 23 -2 82 92152 2 1.80 2
3157 3158 23 -1 13 94720 4 1.00 1
3279 3280 26 -1 44 94901 1 2.00 2
3284 3285 25 -1 101 95819 4 2.10 3
3292 3293 25 -1 13 95616 4 0.40 1
3394 3395 25 -1 113 90089 4 2.10 3
3425 3426 23 -1 12 91605 4 1.00 1
3626 3627 24 -3 28 90089 4 1.00 3
3796 3797 24 -2 50 94920 3 2.40 2
3824 3825 23 -1 12 95064 4 1.00 1
3887 3888 24 -2 118 92634 2 7.20 1
3946 3947 25 -1 40 93117 3 2.40 2
4015 4016 25 -1 139 93106 2 2.00 1
4088 4089 29 -1 71 94801 2 1.75 3
4116 4117 24 -2 135 90065 2 7.20 1
4285 4286 23 -3 149 93555 2 7.20 1
4411 4412 23 -2 75 90291 2 1.80 2
4481 4482 25 -2 35 95045 4 1.00 3
4514 4515 24 -3 41 91768 4 1.00 3
4582 4583 25 -1 69 92691 3 0.30 3
4957 4958 29 -1 50 95842 2 1.75 3
Mortgage Personal_Loan Securities_Account CD_Account Online \
89 0 0 0 0 0
226 0 0 0 0 0
315 0 0 0 0 1
451 89 0 0 0 1
524 0 0 0 0 1
536 176 0 0 0 1
540 314 0 0 0 1
576 0 0 0 0 0
583 0 0 0 0 1
597 0 0 1 0 0
649 0 0 0 0 1
670 239 0 0 0 1
686 0 0 0 0 1
793 0 0 0 0 1
889 0 0 0 0 1
909 305 0 0 0 0
1173 0 0 0 0 0
1428 90 0 0 0 1
1522 256 0 0 0 0
1905 241 0 0 0 1
2102 0 0 0 0 1
2430 0 0 0 0 1
2466 0 0 0 0 1
2545 0 0 0 0 1
2618 145 0 0 0 1
2717 0 0 0 0 1
2848 0 0 0 0 0
2876 238 0 0 0 0
2962 0 0 0 0 0
2980 0 0 0 0 0
3076 0 0 0 0 0
3130 0 0 1 0 0
3157 84 0 0 0 1
3279 0 0 0 0 0
3284 0 0 0 0 0
3292 0 0 1 0 0
3394 0 0 0 0 1
3425 90 0 0 0 1
3626 0 0 0 0 0
3796 0 0 1 0 0
3824 0 0 1 0 0
3887 0 0 1 0 1
3946 0 0 0 0 1
4015 0 0 0 0 0
4088 0 0 0 0 0
4116 0 0 0 0 1
4285 0 0 0 0 1
4411 0 0 0 0 1
4481 0 0 0 0 1
4514 0 0 0 0 1
4582 0 0 0 0 1
4957 0 0 0 0 0
CreditCard
89 1
226 0
315 0
451 0
524 0
536 0
540 0
576 1
583 0
597 1
649 0
670 0
686 0
793 0
889 1
909 1
1173 0
1428 0
1522 1
1905 0
2102 1
2430 0
2466 0
2545 0
2618 0
2717 1
2848 0
2876 0
2962 0
2980 0
3076 1
3130 1
3157 0
3279 0
3284 1
3292 0
3394 0
3425 0
3626 0
3796 0
3824 1
3887 0
3946 0
4015 1
4088 0
4116 0
4285 0
4411 1
4481 0
4514 0
4582 0
4957 1
loan_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
#Treat the negative values in Experience column
#replace negative with nan
loan_data.loc[loan_data['Experience'] == -1, 'Experience'] = np.nan
loan_data.loc[loan_data['Experience'] == -2, 'Experience'] = np.nan
loan_data.loc[loan_data['Experience'] == -3, 'Experience'] = np.nan
loan_data['Experience'] = loan_data.groupby(['Age'])['Experience'].apply(lambda x:x.fillna(x.median()))
print(loan_data['Experience'].value_counts())
32.0 154 20.0 148 9.0 147 5.0 146 23.0 144 35.0 143 25.0 142 28.0 138 18.0 137 19.0 135 26.0 134 24.0 131 3.0 130 14.0 127 16.0 127 30.0 126 17.0 125 34.0 125 27.0 125 29.0 124 22.0 124 7.0 121 6.0 119 8.0 119 15.0 119 10.0 118 13.0 117 33.0 117 37.0 116 4.0 116 11.0 116 36.0 114 21.0 113 31.0 104 12.0 102 1.0 93 38.0 88 2.0 85 39.0 85 0.0 83 40.0 57 41.0 43 42.0 8 43.0 3 Name: Experience, dtype: int64
The negative values in Experience column are first converted to NaN values and then treated with the Median value of Experience column grouped by Age and Income to match the most closest value.
The np.nan conversion changed the Experience column datatype to float64.
loan_data.info()
print(loan_data[loan_data['Experience'].isnull()])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 5000 non-null int64
1 Age 5000 non-null int64
2 Experience 4988 non-null float64
3 Income 5000 non-null int64
4 ZIPCode 5000 non-null int64
5 Family 5000 non-null int64
6 CCAvg 5000 non-null float64
7 Education 5000 non-null int64
8 Mortgage 5000 non-null int64
9 Personal_Loan 5000 non-null int64
10 Securities_Account 5000 non-null int64
11 CD_Account 5000 non-null int64
12 Online 5000 non-null int64
13 CreditCard 5000 non-null int64
dtypes: float64(2), int64(12)
memory usage: 547.0 KB
ID Age Experience Income ZIPCode Family CCAvg Education \
670 671 23 NaN 61 92374 4 2.60 1
909 910 23 NaN 149 91709 1 6.33 1
2430 2431 23 NaN 73 92120 4 2.60 1
2618 2619 23 NaN 55 92704 3 2.40 2
2717 2718 23 NaN 45 95422 4 0.60 2
2962 2963 23 NaN 81 91711 2 1.80 2
3130 3131 23 NaN 82 92152 2 1.80 2
3157 3158 23 NaN 13 94720 4 1.00 1
3425 3426 23 NaN 12 91605 4 1.00 1
3824 3825 23 NaN 12 95064 4 1.00 1
4285 4286 23 NaN 149 93555 2 7.20 1
4411 4412 23 NaN 75 90291 2 1.80 2
Mortgage Personal_Loan Securities_Account CD_Account Online \
670 239 0 0 0 1
909 305 0 0 0 0
2430 0 0 0 0 1
2618 145 0 0 0 1
2717 0 0 0 0 1
2962 0 0 0 0 0
3130 0 0 1 0 0
3157 84 0 0 0 1
3425 90 0 0 0 1
3824 0 0 1 0 0
4285 0 0 0 0 1
4411 0 0 0 0 1
CreditCard
670 0
909 1
2430 0
2618 0
2717 1
2962 0
3130 1
3157 0
3425 0
3824 1
4285 0
4411 1
print(loan_data[loan_data.Age == 24])
ID Age Experience Income ZIPCode Family CCAvg Education \
105 106 24 0.0 35 94704 3 0.1 2
155 156 24 0.0 60 94596 4 1.6 1
182 183 24 0.0 135 95133 1 1.5 1
226 227 24 0.0 39 94085 2 1.7 2
315 316 24 0.0 51 90630 3 0.3 3
524 525 24 0.0 75 93014 4 0.2 1
583 584 24 0.0 38 95045 2 1.7 2
597 598 24 0.0 125 92835 2 7.2 1
686 687 24 0.0 38 92612 4 0.6 2
793 794 24 0.0 150 94720 2 2.0 1
873 874 24 0.0 88 90740 3 0.8 1
889 890 24 0.0 82 91103 2 1.6 3
1173 1174 24 0.0 35 94305 2 1.7 2
2259 2260 24 0.0 82 90401 3 0.8 1
2466 2467 24 0.0 80 94105 2 1.6 3
2652 2653 24 0.0 44 90089 4 1.6 1
2848 2849 24 0.0 78 94720 2 1.8 2
2876 2877 24 0.0 80 91107 2 1.6 3
3626 3627 24 0.0 28 90089 4 1.0 3
3796 3797 24 0.0 50 94920 3 2.4 2
3887 3888 24 0.0 118 92634 2 7.2 1
3908 3909 24 0.0 44 90638 3 0.1 2
3982 3983 24 0.0 119 94566 1 1.5 1
4116 4117 24 0.0 135 90065 2 7.2 1
4393 4394 24 0.0 59 95521 4 1.6 1
4514 4515 24 0.0 41 91768 4 1.0 3
4566 4567 24 0.0 131 92831 1 5.4 1
4989 4990 24 0.0 38 93555 1 1.0 3
Mortgage Personal_Loan Securities_Account CD_Account Online \
105 0 0 1 0 1
155 0 0 0 0 1
182 0 0 0 0 1
226 0 0 0 0 0
315 0 0 0 0 1
524 0 0 0 0 1
583 0 0 0 0 1
597 0 0 1 0 0
686 0 0 0 0 1
793 0 0 0 0 1
873 134 0 0 0 0
889 0 0 0 0 1
1173 0 0 0 0 0
2259 0 0 0 0 1
2466 0 0 0 0 1
2652 180 0 0 0 1
2848 0 0 0 0 0
2876 238 0 0 0 0
3626 0 0 0 0 0
3796 0 0 1 0 0
3887 0 0 1 0 1
3908 0 0 0 0 0
3982 0 0 0 0 1
4116 0 0 0 0 1
4393 0 0 0 0 0
4514 0 0 0 0 1
4566 0 0 0 0 1
4989 0 0 0 0 1
CreditCard
105 0
155 0
182 0
226 0
315 0
524 0
583 0
597 1
686 0
793 0
873 0
889 1
1173 0
2259 0
2466 0
2652 0
2848 0
2876 0
3626 0
3796 0
3887 0
3908 0
3982 0
4116 0
4393 0
4514 0
4566 0
4989 0
print(loan_data[loan_data.Age == 23])
ID Age Experience Income ZIPCode Family CCAvg Education \
670 671 23 NaN 61 92374 4 2.60 1
909 910 23 NaN 149 91709 1 6.33 1
2430 2431 23 NaN 73 92120 4 2.60 1
2618 2619 23 NaN 55 92704 3 2.40 2
2717 2718 23 NaN 45 95422 4 0.60 2
2962 2963 23 NaN 81 91711 2 1.80 2
3130 3131 23 NaN 82 92152 2 1.80 2
3157 3158 23 NaN 13 94720 4 1.00 1
3425 3426 23 NaN 12 91605 4 1.00 1
3824 3825 23 NaN 12 95064 4 1.00 1
4285 4286 23 NaN 149 93555 2 7.20 1
4411 4412 23 NaN 75 90291 2 1.80 2
Mortgage Personal_Loan Securities_Account CD_Account Online \
670 239 0 0 0 1
909 305 0 0 0 0
2430 0 0 0 0 1
2618 145 0 0 0 1
2717 0 0 0 0 1
2962 0 0 0 0 0
3130 0 0 1 0 0
3157 84 0 0 0 1
3425 90 0 0 0 1
3824 0 0 1 0 0
4285 0 0 0 0 1
4411 0 0 0 0 1
CreditCard
670 0
909 1
2430 0
2618 0
2717 1
2962 0
3130 1
3157 0
3425 0
3824 1
4285 0
4411 1
print(loan_data[loan_data.Age == 25])
ID Age Experience Income ZIPCode Family CCAvg Education \
0 1 25 1.0 49 91107 4 1.60 1
89 90 25 1.0 113 94303 4 2.30 3
143 144 25 1.0 54 94117 4 1.60 1
166 167 25 1.0 21 95827 3 1.00 2
347 348 25 0.0 43 94305 2 1.60 3
363 364 25 0.0 30 92691 2 1.70 2
379 380 25 0.0 28 92093 2 1.70 2
466 467 25 0.0 13 91342 2 0.90 3
484 485 25 1.0 113 95023 2 0.20 1
495 496 25 0.0 44 94545 4 0.60 2
536 537 25 1.0 43 92173 3 2.40 2
540 541 25 1.0 109 94010 4 2.30 3
576 577 25 1.0 48 92870 3 0.30 3
649 650 25 1.0 82 92677 4 2.10 3
1003 1004 25 1.0 62 94720 4 0.00 1
1065 1066 25 1.0 113 90401 3 2.50 1
1092 1093 25 1.0 70 92120 4 2.60 1
1181 1182 25 0.0 65 90095 4 0.20 1
1428 1429 25 1.0 21 94583 4 0.40 1
1522 1523 25 1.0 101 94720 4 2.30 3
1732 1733 25 0.0 88 94566 2 1.80 2
1847 1848 25 0.0 52 95126 3 2.60 3
1868 1869 25 1.0 118 92833 1 5.40 1
1905 1906 25 1.0 112 92507 2 2.00 1
2009 2010 25 0.0 99 92735 1 1.90 1
2102 2103 25 1.0 81 92647 2 1.60 3
2157 2158 25 0.0 71 93727 4 0.20 1
2192 2193 25 1.0 13 95814 4 1.00 1
2226 2227 25 1.0 98 90717 1 5.40 1
2417 2418 25 0.0 53 90095 2 1.60 3
2446 2447 25 1.0 70 93010 4 2.60 1
2452 2453 25 1.0 28 94596 1 1.00 3
2545 2546 25 1.0 39 94720 3 2.40 2
2836 2837 25 1.0 74 94085 4 2.60 1
2980 2981 25 1.0 53 94305 3 2.40 2
3010 3011 25 1.0 72 94301 3 0.80 1
3135 3136 25 0.0 91 95039 2 1.80 2
3284 3285 25 1.0 101 95819 4 2.10 3
3292 3293 25 1.0 13 95616 4 0.40 1
3378 3379 25 0.0 44 94536 4 0.60 2
3394 3395 25 1.0 113 90089 4 2.10 3
3486 3487 25 1.0 20 92806 4 1.00 1
3870 3871 25 0.0 25 94596 2 0.90 3
3946 3947 25 1.0 40 93117 3 2.40 2
4015 4016 25 1.0 139 93106 2 2.00 1
4046 4047 25 0.0 72 94303 3 2.60 3
4271 4272 25 1.0 150 92507 1 6.33 1
4481 4482 25 1.0 35 95045 4 1.00 3
4582 4583 25 1.0 69 92691 3 0.30 3
4677 4678 25 0.0 38 93407 2 1.60 3
4712 4713 25 0.0 14 94309 2 0.90 3
4713 4714 25 1.0 122 93022 2 0.20 1
4888 4889 25 1.0 121 93106 1 5.40 1
Mortgage Personal_Loan Securities_Account CD_Account Online \
0 0 0 1 0 0
89 0 0 0 0 0
143 0 0 0 0 1
166 0 0 0 0 0
347 0 0 1 1 1
363 0 0 0 0 0
379 0 0 0 0 0
466 0 0 0 0 1
484 0 0 0 0 1
495 0 0 0 0 1
536 176 0 0 0 1
540 314 0 0 0 1
576 0 0 0 0 0
649 0 0 0 0 1
1003 229 0 0 0 1
1065 0 0 0 0 0
1092 0 0 1 0 1
1181 0 0 1 0 0
1428 90 0 0 0 1
1522 256 0 0 0 0
1732 319 0 0 0 1
1847 159 0 0 0 0
1868 0 0 0 0 1
1905 241 0 0 0 1
2009 323 0 0 0 0
2102 0 0 0 0 1
2157 78 0 1 0 0
2192 95 0 0 0 0
2226 0 0 0 0 1
2417 0 0 0 0 1
2446 218 0 0 0 1
2452 0 0 0 0 1
2545 0 0 0 0 1
2836 204 0 0 0 0
2980 0 0 0 0 0
3010 0 0 0 0 1
3135 321 0 0 0 0
3284 0 0 0 0 0
3292 0 0 1 0 0
3378 0 0 0 0 0
3394 0 0 0 0 1
3486 0 0 0 0 0
3870 0 0 0 0 0
3946 0 0 0 0 1
4015 0 0 0 0 0
4046 0 0 0 0 1
4271 0 0 0 0 0
4481 0 0 0 0 1
4582 0 0 0 0 1
4677 0 0 0 0 0
4712 0 0 0 0 0
4713 0 0 0 0 1
4888 158 0 0 0 1
CreditCard
0 0
89 1
143 1
166 0
347 1
363 0
379 0
466 0
484 1
495 1
536 0
540 0
576 1
649 0
1003 0
1065 1
1092 0
1181 0
1428 0
1522 1
1732 1
1847 0
1868 1
1905 0
2009 0
2102 1
2157 0
2192 1
2226 0
2417 1
2446 0
2452 0
2545 0
2836 0
2980 0
3010 0
3135 0
3284 1
3292 0
3378 1
3394 0
3486 1
3870 0
3946 0
4015 1
4046 0
4271 0
4481 0
4582 0
4677 0
4712 1
4713 0
4888 0
print(loan_data[loan_data.Age == 29])
ID Age Experience Income ZIPCode Family CCAvg Education \
11 12 29 5.0 45 90277 3 0.10 2
22 23 29 5.0 62 90277 1 1.20 1
54 55 29 5.0 44 95819 1 0.20 3
160 161 29 0.0 134 95819 4 6.50 3
177 178 29 3.0 65 94132 4 1.80 2
183 184 29 3.0 148 92173 3 4.10 1
272 273 29 3.0 45 95023 4 0.20 1
277 278 29 2.0 30 92126 4 1.00 3
338 339 29 3.0 153 93657 2 2.00 1
401 402 29 2.0 30 95747 4 1.50 2
457 458 29 3.0 69 94303 3 0.30 3
462 463 29 4.0 183 91423 3 8.30 3
483 484 29 5.0 30 90095 3 1.00 2
574 575 29 5.0 80 94709 2 2.00 2
590 591 29 3.0 39 94612 4 2.10 3
602 603 29 5.0 135 95035 2 0.60 1
675 676 29 2.0 33 91711 1 2.00 2
695 696 29 4.0 115 92717 1 1.90 1
709 710 29 4.0 72 95841 4 2.20 1
716 717 29 5.0 31 96064 4 0.40 2
750 751 29 5.0 138 93106 2 4.33 1
760 761 29 3.0 52 92122 3 1.10 2
789 790 29 3.0 31 92126 4 0.30 2
798 799 29 2.0 38 93063 1 2.00 2
799 800 29 3.0 39 95051 4 2.10 3
830 831 29 5.0 72 92407 3 0.70 2
894 895 29 4.0 59 95064 4 2.20 1
906 907 29 3.0 154 94720 2 2.00 1
1019 1020 29 3.0 30 91745 4 0.30 2
1028 1029 29 4.0 110 92096 4 2.50 3
1077 1078 29 3.0 175 90095 3 3.30 3
1102 1103 29 3.0 84 95023 1 2.90 3
1175 1176 29 4.0 58 91006 1 0.80 2
1176 1177 29 3.0 103 90049 4 3.40 1
1191 1192 29 5.0 128 94111 1 1.50 1
1194 1195 29 3.0 41 94305 4 1.30 3
1199 1200 29 4.0 62 92064 2 2.50 1
1242 1243 29 4.0 44 91380 4 2.00 2
1286 1287 29 3.0 50 94010 3 1.10 2
1303 1304 29 5.0 112 94720 2 2.00 2
1350 1351 29 2.0 29 90266 4 1.50 2
1390 1391 29 3.0 80 94305 4 1.80 2
1424 1425 29 3.0 92 94539 2 1.30 1
1446 1447 29 4.0 22 92661 2 0.90 3
1453 1454 29 5.0 85 90232 3 2.50 1
1539 1540 29 5.0 21 90601 3 0.90 3
1552 1553 29 5.0 195 94301 1 4.30 1
1579 1580 29 5.0 122 94305 4 3.00 1
1588 1589 29 3.0 55 95616 3 1.10 2
1618 1619 29 3.0 29 94720 3 1.00 1
1649 1650 29 4.0 73 95039 1 0.80 2
1673 1674 29 5.0 81 94115 2 2.50 1
1701 1702 29 3.0 108 94304 4 1.80 2
1747 1748 29 5.0 21 90717 4 0.40 2
1785 1786 29 3.0 190 94080 2 4.50 1
1802 1803 29 3.0 121 92806 2 1.30 1
1957 1958 29 4.0 121 90028 2 3.30 1
1975 1976 29 3.0 113 94132 2 0.20 1
2065 2066 29 5.0 83 92354 3 1.50 1
2072 2073 29 3.0 39 95831 4 0.20 1
2188 2189 29 4.0 9 92037 4 0.50 3
2276 2277 29 3.0 172 92093 4 4.40 1
2410 2411 29 4.0 130 92630 2 6.70 1
2427 2428 29 5.0 34 92675 4 0.40 2
2489 2490 29 3.0 41 92626 4 0.20 1
2529 2530 29 5.0 44 95819 3 0.10 2
2641 2642 29 5.0 133 90095 1 5.40 1
2731 2732 29 5.0 28 96651 1 0.20 3
2741 2742 29 3.0 49 90266 1 1.50 1
2820 2821 29 4.0 102 90245 2 3.30 1
2863 2864 29 5.0 70 93101 4 0.00 1
2942 2943 29 5.0 160 90405 1 4.30 1
2963 2964 29 3.0 41 94588 1 1.90 3
3012 3013 29 3.0 172 92373 2 4.50 1
3041 3042 29 5.0 92 95006 2 0.60 1
3073 3074 29 5.0 149 94611 1 1.50 1
3076 3077 29 4.0 62 92672 2 1.75 3
3093 3094 29 5.0 34 90717 4 0.40 2
3114 3115 29 4.0 55 90024 4 2.00 2
3166 3167 29 4.0 80 90028 1 0.80 2
3340 3341 29 3.0 54 94104 4 1.80 3
3390 3391 29 3.0 73 94720 3 0.30 3
3409 3410 29 5.0 113 95351 2 2.00 2
3450 3451 29 4.0 14 94590 4 0.50 3
3453 3454 29 3.0 31 94709 4 0.30 2
3487 3488 29 4.0 104 91711 4 1.80 3
3494 3495 29 2.0 31 91330 4 1.50 2
3503 3504 29 3.0 53 95814 4 2.10 3
3523 3524 29 4.0 150 91302 1 0.80 1
3578 3579 29 5.0 128 91302 2 4.10 2
3609 3610 29 5.0 162 94022 1 4.30 1
3661 3662 29 4.0 120 94553 1 4.10 2
3715 3716 29 5.0 124 92037 2 0.20 1
3769 3770 29 4.0 134 90095 2 3.30 1
3805 3806 29 5.0 84 93109 3 0.80 1
3877 3878 29 4.0 41 93105 1 1.00 1
3904 3905 29 5.0 18 94122 1 0.40 3
3945 3946 29 3.0 123 92821 3 5.60 3
3962 3963 29 5.0 31 93014 1 1.00 3
3972 3973 29 5.0 112 94998 2 4.33 1
4042 4043 29 3.0 190 92612 2 4.50 1
4088 4089 29 4.0 71 94801 2 1.75 3
4129 4130 29 3.0 10 91320 4 0.40 1
4139 4140 29 3.0 81 95827 1 2.90 3
4179 4180 29 3.0 91 94122 1 3.40 3
4404 4405 29 5.0 34 94301 1 0.40 3
4413 4414 29 2.0 31 91775 4 1.50 2
4456 4457 29 3.0 35 94040 2 0.30 1
4494 4495 29 4.0 182 95354 1 3.70 3
4515 4516 29 3.0 49 94305 4 2.10 3
4523 4524 29 4.0 50 94040 4 1.70 2
4688 4689 29 3.0 69 92093 4 1.80 2
4717 4718 29 5.0 121 95449 1 1.50 1
4812 4813 29 4.0 184 92126 4 2.20 3
4832 4833 29 4.0 83 91950 4 2.20 2
4916 4917 29 5.0 123 90291 2 0.60 1
4949 4950 29 5.0 64 94114 4 0.00 1
4952 4953 29 3.0 53 94005 4 1.80 3
4957 4958 29 4.0 50 95842 2 1.75 3
4965 4966 29 5.0 33 94709 1 1.80 2
4976 4977 29 5.0 31 95039 1 1.80 2
4980 4981 29 5.0 135 95762 3 5.30 1
4995 4996 29 3.0 40 92697 1 1.90 3
Mortgage Personal_Loan Securities_Account CD_Account Online \
11 0 0 0 0 1
22 260 0 0 0 1
54 0 0 0 0 1
160 0 1 0 0 0
177 244 0 0 0 0
183 0 1 0 0 1
272 158 0 0 0 1
277 0 0 0 0 0
338 392 0 0 0 0
401 112 0 0 0 0
457 0 0 0 0 0
462 0 1 0 0 1
483 0 0 0 0 0
574 0 0 0 0 1
590 0 0 0 0 1
602 0 0 0 0 0
675 160 0 0 0 0
695 0 0 0 0 0
709 0 0 0 0 1
716 161 0 0 0 1
750 0 0 0 0 1
760 0 0 0 0 1
789 0 0 0 0 1
798 0 0 0 0 0
799 0 0 0 0 1
830 81 0 0 0 0
894 232 0 0 0 1
906 130 0 0 0 0
1019 157 0 0 0 0
1028 0 1 0 0 0
1077 329 1 0 0 1
1102 0 0 0 0 1
1175 0 0 0 0 1
1176 0 1 0 0 1
1191 0 0 0 0 1
1194 0 0 0 0 1
1199 184 0 0 0 1
1242 0 0 0 0 1
1286 0 0 0 0 0
1303 382 0 1 0 0
1350 0 0 0 0 0
1390 0 0 0 0 1
1424 287 0 0 0 1
1446 110 0 0 0 0
1453 0 0 0 0 1
1539 119 0 0 0 0
1552 0 0 0 0 0
1579 0 1 0 0 0
1588 0 0 0 0 1
1618 0 0 0 0 1
1649 0 0 0 0 1
1673 0 0 0 0 0
1701 0 0 0 0 0
1747 89 0 0 0 0
1785 0 0 0 0 1
1802 0 0 0 0 0
1957 0 0 0 0 1
1975 0 0 0 0 1
2065 0 0 0 0 1
2072 137 0 0 0 1
2188 86 0 0 0 1
2276 0 1 0 0 0
2410 0 0 0 0 0
2427 0 0 0 0 1
2489 0 0 0 0 1
2529 0 0 0 0 1
2641 212 0 0 0 1
2731 0 0 0 0 1
2741 0 0 0 0 0
2820 303 0 0 0 0
2863 0 0 0 0 1
2942 385 0 0 0 1
2963 0 0 0 0 1
3012 415 0 0 0 1
3041 0 0 0 0 1
3073 0 0 0 0 1
3076 0 0 0 0 0
3093 0 0 0 0 0
3114 0 0 1 0 1
3166 0 0 0 0 1
3340 0 0 0 0 0
3390 0 0 0 0 0
3409 84 0 0 0 1
3450 0 0 0 0 0
3453 0 0 0 0 1
3487 0 0 0 0 0
3494 0 0 0 0 0
3503 0 0 0 0 1
3523 0 0 0 0 0
3578 209 1 0 0 1
3609 0 0 0 0 0
3661 0 1 1 1 0
3715 0 0 0 0 0
3769 204 0 0 0 0
3805 0 0 0 0 0
3877 0 0 0 0 0
3904 94 0 0 0 1
3945 428 1 0 0 1
3962 0 0 0 0 0
3972 0 0 0 0 1
4042 246 0 0 0 1
4088 0 0 0 0 0
4129 87 0 0 0 1
4139 0 0 0 0 0
4179 0 1 0 0 0
4404 0 0 0 0 0
4413 0 0 0 0 0
4456 88 0 0 1 1
4494 0 1 0 0 1
4515 0 0 0 0 0
4523 0 0 0 0 1
4688 0 0 0 0 1
4717 0 0 0 0 1
4812 612 1 0 0 1
4832 0 0 0 0 1
4916 0 0 0 0 1
4949 249 0 0 0 0
4952 0 0 0 0 1
4957 0 0 0 0 0
4965 78 0 0 0 1
4976 0 0 0 0 1
4980 0 1 0 1 1
4995 0 0 0 0 1
CreditCard
11 0
22 0
54 0
160 0
177 0
183 0
272 1
277 0
338 0
401 1
457 0
462 0
483 0
574 1
590 0
602 0
675 0
695 0
709 0
716 1
750 0
760 0
789 0
798 0
799 0
830 0
894 1
906 0
1019 0
1028 0
1077 0
1102 0
1175 1
1176 0
1191 1
1194 0
1199 0
1242 0
1286 1
1303 0
1350 1
1390 1
1424 0
1446 0
1453 1
1539 0
1552 0
1579 1
1588 0
1618 1
1649 0
1673 1
1701 0
1747 1
1785 0
1802 0
1957 0
1975 1
2065 1
2072 1
2188 1
2276 0
2410 1
2427 0
2489 0
2529 1
2641 0
2731 0
2741 0
2820 0
2863 1
2942 0
2963 1
3012 0
3041 0
3073 0
3076 1
3093 1
3114 0
3166 1
3340 0
3390 0
3409 1
3450 1
3453 0
3487 1
3494 0
3503 0
3523 1
3578 0
3609 1
3661 1
3715 1
3769 0
3805 0
3877 0
3904 1
3945 0
3962 0
3972 1
4042 1
4088 0
4129 1
4139 0
4179 0
4404 0
4413 1
4456 1
4494 0
4515 0
4523 0
4688 1
4717 0
4812 0
4832 1
4916 0
4949 1
4952 0
4957 1
4965 0
4976 1
4980 1
4995 0
#replace Nan with zero experience for age 23
loan_data['Experience'] = loan_data.groupby(['Age'])['Experience'].apply(lambda x:x.fillna(0))
loan_data[loan_data.Age == 23]
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 670 | 671 | 23 | 0.0 | 61 | 92374 | 4 | 2.60 | 1 | 239 | 0 | 0 | 0 | 1 | 0 |
| 909 | 910 | 23 | 0.0 | 149 | 91709 | 1 | 6.33 | 1 | 305 | 0 | 0 | 0 | 0 | 1 |
| 2430 | 2431 | 23 | 0.0 | 73 | 92120 | 4 | 2.60 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2618 | 2619 | 23 | 0.0 | 55 | 92704 | 3 | 2.40 | 2 | 145 | 0 | 0 | 0 | 1 | 0 |
| 2717 | 2718 | 23 | 0.0 | 45 | 95422 | 4 | 0.60 | 2 | 0 | 0 | 0 | 0 | 1 | 1 |
| 2962 | 2963 | 23 | 0.0 | 81 | 91711 | 2 | 1.80 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3130 | 3131 | 23 | 0.0 | 82 | 92152 | 2 | 1.80 | 2 | 0 | 0 | 1 | 0 | 0 | 1 |
| 3157 | 3158 | 23 | 0.0 | 13 | 94720 | 4 | 1.00 | 1 | 84 | 0 | 0 | 0 | 1 | 0 |
| 3425 | 3426 | 23 | 0.0 | 12 | 91605 | 4 | 1.00 | 1 | 90 | 0 | 0 | 0 | 1 | 0 |
| 3824 | 3825 | 23 | 0.0 | 12 | 95064 | 4 | 1.00 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
| 4285 | 4286 | 23 | 0.0 | 149 | 93555 | 2 | 7.20 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4411 | 4412 | 23 | 0.0 | 75 | 90291 | 2 | 1.80 | 2 | 0 | 0 | 0 | 0 | 1 | 1 |
#Convert the data types of the variables
loan_data['Family'] = loan_data['Family'].astype('category')
loan_data['Education'] = loan_data['Education'].astype('category')
loan_data['Securities_Account'] = loan_data['Securities_Account'].astype('category')
loan_data['CD_Account'] = loan_data['CD_Account'].astype('category')
loan_data['Online'] = loan_data['Online'].astype('category')
loan_data['CreditCard'] = loan_data['CreditCard'].astype('category')
loan_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null float64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null category 6 CCAvg 5000 non-null float64 7 Education 5000 non-null category 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null category 11 CD_Account 5000 non-null category 12 Online 5000 non-null category 13 CreditCard 5000 non-null category dtypes: category(6), float64(2), int64(6) memory usage: 342.6 KB
from uszipcode import SearchEngine
search = SearchEngine()
loan_data['ZIPCode_County'] = loan_data['ZIPCode'].apply(lambda x: search.by_zipcode(x).county)
loan_data.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | ZIPCode_County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1.0 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | Los Angeles County |
| 1 | 2 | 45 | 19.0 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | Los Angeles County |
| 2 | 3 | 39 | 15.0 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Alameda County |
| 3 | 4 | 35 | 9.0 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | San Francisco County |
| 4 | 5 | 35 | 8.0 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | Los Angeles County |
loan_data['ZIPCode_County'].value_counts()
Los Angeles County 1095 San Diego County 568 Santa Clara County 563 Alameda County 500 Orange County 339 San Francisco County 257 San Mateo County 204 Sacramento County 184 Santa Barbara County 154 Yolo County 130 Monterey County 128 Ventura County 114 San Bernardino County 101 Contra Costa County 85 Santa Cruz County 68 Riverside County 56 Marin County 54 Kern County 54 Solano County 33 San Luis Obispo County 33 Humboldt County 32 Sonoma County 28 Fresno County 26 Placer County 24 Butte County 19 Shasta County 18 El Dorado County 17 Stanislaus County 15 San Benito County 14 San Joaquin County 13 Mendocino County 8 Tuolumne County 7 Siskiyou County 7 Lake County 4 Trinity County 4 Merced County 4 Napa County 3 Imperial County 3 Name: ZIPCode_County, dtype: int64
loan_data[loan_data.ZIPCode_County.isnull()]
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | ZIPCode_County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 106 | 107 | 43 | 17.0 | 69 | 92717 | 4 | 2.90 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | None |
| 172 | 173 | 38 | 13.0 | 171 | 92717 | 2 | 7.80 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | None |
| 184 | 185 | 52 | 26.0 | 63 | 92717 | 2 | 1.50 | 2 | 0 | 0 | 1 | 0 | 1 | 0 | None |
| 321 | 322 | 44 | 20.0 | 101 | 92717 | 3 | 4.40 | 2 | 82 | 1 | 0 | 0 | 0 | 0 | None |
| 366 | 367 | 50 | 24.0 | 35 | 92717 | 1 | 0.30 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | None |
| 384 | 385 | 51 | 25.0 | 21 | 93077 | 4 | 0.60 | 3 | 0 | 0 | 0 | 0 | 1 | 1 | None |
| 468 | 469 | 34 | 10.0 | 21 | 92634 | 1 | 0.50 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | None |
| 476 | 477 | 60 | 34.0 | 53 | 92717 | 1 | 0.80 | 2 | 0 | 0 | 1 | 0 | 0 | 1 | None |
| 630 | 631 | 32 | 7.0 | 35 | 96651 | 3 | 1.30 | 1 | 108 | 0 | 0 | 0 | 0 | 1 | None |
| 672 | 673 | 51 | 27.0 | 23 | 96651 | 1 | 0.20 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | None |
| 695 | 696 | 29 | 4.0 | 115 | 92717 | 1 | 1.90 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | None |
| 721 | 722 | 49 | 24.0 | 39 | 92717 | 1 | 1.40 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | None |
| 780 | 781 | 32 | 7.0 | 42 | 92634 | 4 | 0.80 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | None |
| 1099 | 1100 | 30 | 6.0 | 52 | 92717 | 3 | 0.70 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | None |
| 1189 | 1190 | 42 | 17.0 | 115 | 92717 | 2 | 0.40 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | None |
| 1426 | 1427 | 37 | 11.0 | 60 | 96651 | 3 | 0.50 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | None |
| 1483 | 1484 | 58 | 32.0 | 63 | 92717 | 1 | 1.60 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | None |
| 1653 | 1654 | 26 | 1.0 | 24 | 96651 | 2 | 0.90 | 3 | 123 | 0 | 0 | 0 | 0 | 1 | None |
| 1752 | 1753 | 33 | 8.0 | 155 | 92717 | 1 | 7.40 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | None |
| 1844 | 1845 | 65 | 40.0 | 21 | 92717 | 3 | 0.10 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | None |
| 2049 | 2050 | 43 | 18.0 | 94 | 92717 | 4 | 1.10 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | None |
| 2211 | 2212 | 39 | 14.0 | 31 | 92717 | 2 | 1.40 | 2 | 94 | 0 | 0 | 0 | 1 | 1 | None |
| 2218 | 2219 | 38 | 13.0 | 9 | 92634 | 2 | 0.30 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | None |
| 2428 | 2429 | 39 | 12.0 | 108 | 92717 | 4 | 3.67 | 2 | 301 | 1 | 0 | 0 | 0 | 1 | None |
| 2486 | 2487 | 61 | 36.0 | 130 | 92717 | 1 | 1.30 | 1 | 257 | 0 | 0 | 0 | 0 | 0 | None |
| 2731 | 2732 | 29 | 5.0 | 28 | 96651 | 1 | 0.20 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | None |
| 2957 | 2958 | 61 | 36.0 | 53 | 92717 | 3 | 0.50 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | None |
| 3525 | 3526 | 59 | 34.0 | 13 | 96651 | 4 | 0.90 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | None |
| 3887 | 3888 | 24 | 0.0 | 118 | 92634 | 2 | 7.20 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | None |
| 4090 | 4091 | 42 | 18.0 | 49 | 92717 | 3 | 2.10 | 3 | 0 | 0 | 1 | 0 | 1 | 0 | None |
| 4276 | 4277 | 50 | 24.0 | 155 | 92717 | 1 | 7.30 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | None |
| 4321 | 4322 | 27 | 0.0 | 34 | 92717 | 1 | 2.00 | 2 | 112 | 0 | 0 | 0 | 0 | 1 | None |
| 4384 | 4385 | 45 | 20.0 | 61 | 92717 | 3 | 2.70 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | None |
| 4392 | 4393 | 52 | 27.0 | 81 | 92634 | 4 | 3.80 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | None |
#Lets replace the None values in County column with Unknown
loan_data['ZIPCode_County'] = loan_data['ZIPCode_County'].fillna('Unknown')
loan_data["ZIPCode_County"].value_counts()
Los Angeles County 1095 San Diego County 568 Santa Clara County 563 Alameda County 500 Orange County 339 San Francisco County 257 San Mateo County 204 Sacramento County 184 Santa Barbara County 154 Yolo County 130 Monterey County 128 Ventura County 114 San Bernardino County 101 Contra Costa County 85 Santa Cruz County 68 Riverside County 56 Marin County 54 Kern County 54 Unknown 34 San Luis Obispo County 33 Solano County 33 Humboldt County 32 Sonoma County 28 Fresno County 26 Placer County 24 Butte County 19 Shasta County 18 El Dorado County 17 Stanislaus County 15 San Benito County 14 San Joaquin County 13 Mendocino County 8 Tuolumne County 7 Siskiyou County 7 Merced County 4 Trinity County 4 Lake County 4 Imperial County 3 Napa County 3 Name: ZIPCode_County, dtype: int64
#Check the unique values in ZIPCode_County column
loan_data['ZIPCode_County'].unique()
array(['Los Angeles County', 'Alameda County', 'San Francisco County',
'San Diego County', 'Monterey County', 'Ventura County',
'Santa Barbara County', 'Marin County', 'Santa Clara County',
'Santa Cruz County', 'San Mateo County', 'Humboldt County',
'Contra Costa County', 'Orange County', 'Sacramento County',
'Yolo County', 'Placer County', 'San Bernardino County',
'San Luis Obispo County', 'Riverside County', 'Kern County',
'Unknown', 'Fresno County', 'Sonoma County', 'El Dorado County',
'San Benito County', 'Butte County', 'Solano County',
'Mendocino County', 'San Joaquin County', 'Imperial County',
'Siskiyou County', 'Merced County', 'Trinity County',
'Stanislaus County', 'Shasta County', 'Tuolumne County',
'Napa County', 'Lake County'], dtype=object)
loan_data['ZIPCode_County'] = loan_data['ZIPCode_County'].astype('category')
#Group the Age into AgeRange bucket by adding a new column to the dataframe
loan_data['AgeRange'] = pd.cut(x = loan_data['Age'],bins = [20,30,40,50,60,70])
#Group the Income into IncomeRange bucket by adding a new column to the dataframe
loan_data['IncomeRange'] = pd.cut(x = loan_data['Income'],bins = [0,50,100,150,200,230])
#Group the CCAvg into CCAvgRange bucket by adding a new column to the dataframe
loan_data['CCAvgRange'] = pd.cut(x = loan_data['CCAvg'],bins = [0,1,2,3,4,5,6,7,8,9,10])
#Group the Experience into ExperienceRange bucket by adding a new column to the dataframe
loan_data['ExperienceRange'] = pd.cut(x = loan_data['Experience'],bins = [0,5,10,15,20,25,30,35,40,45])
#Group the Mortgage into MortgageRange bucket by adding a new column to the dataframe
loan_data['MortgageRange'] = pd.cut(x = loan_data['Mortgage'],bins = [0,90,100,200,300,400,500,600,700])
print(loan_data.head())
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage \
0 1 25 1.0 49 91107 4 1.6 1 0
1 2 45 19.0 34 90089 3 1.5 1 0
2 3 39 15.0 11 94720 1 1.0 1 0
3 4 35 9.0 100 94112 1 2.7 2 0
4 5 35 8.0 45 91330 4 1.0 2 0
Personal_Loan Securities_Account CD_Account Online CreditCard \
0 0 1 0 0 0
1 0 1 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 1
ZIPCode_County AgeRange IncomeRange CCAvgRange ExperienceRange \
0 Los Angeles County (20, 30] (0, 50] (1, 2] (0, 5]
1 Los Angeles County (40, 50] (0, 50] (1, 2] (15, 20]
2 Alameda County (30, 40] (0, 50] (0, 1] (10, 15]
3 San Francisco County (30, 40] (50, 100] (2, 3] (5, 10]
4 Los Angeles County (30, 40] (0, 50] (0, 1] (5, 10]
MortgageRange
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
#Lets replace the numeric values in Education column with names
loan_data.Education = loan_data.Education.apply(lambda x: 'Undergrad' if x == 1 else x)
loan_data.Education = loan_data.Education.apply(lambda x: 'Graduate' if x == 2 else x)
loan_data.Education = loan_data.Education.apply(lambda x: 'Advanced/Professional' if x == 3 else x)
print(loan_data.head())
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage \
0 1 25 1.0 49 91107 4 1.6 Undergrad 0
1 2 45 19.0 34 90089 3 1.5 Undergrad 0
2 3 39 15.0 11 94720 1 1.0 Undergrad 0
3 4 35 9.0 100 94112 1 2.7 Graduate 0
4 5 35 8.0 45 91330 4 1.0 Graduate 0
Personal_Loan Securities_Account CD_Account Online CreditCard \
0 0 1 0 0 0
1 0 1 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 1
ZIPCode_County AgeRange IncomeRange CCAvgRange ExperienceRange \
0 Los Angeles County (20, 30] (0, 50] (1, 2] (0, 5]
1 Los Angeles County (40, 50] (0, 50] (1, 2] (15, 20]
2 Alameda County (30, 40] (0, 50] (0, 1] (10, 15]
3 San Francisco County (30, 40] (50, 100] (2, 3] (5, 10]
4 Los Angeles County (30, 40] (0, 50] (0, 1] (5, 10]
MortgageRange
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Education column has been converted from numeric values to categorical names.
1 = Undergraduate, 2 = Graduate, 3 - Advanced/Professional
# lets plot histogram of all numerical variables
all_col = loan_data.select_dtypes(include=np.number).columns.tolist()
all_col.remove('ID')
all_col.remove('Personal_Loan')
plt.figure(figsize=(17, 75))
for i in range(len(all_col)):
plt.subplot(18, 3, i + 1)
#plt.hist(df[all_col[i]])
sns.histplot(loan_data[all_col[i]], kde=True) # you can comment the previous line and run this one to get distribution curves
plt.tight_layout()
plt.title(all_col[i], fontsize=25)
plt.show()
# While doing uni-variate analysis of numerical variables we want to study their central tendency
# and dispersion.
# Let us write a function that will help us create boxplot and histogram for any input numerical
# variable.
# This function takes the numerical column as the input and returns the boxplots
# and histograms for the variable.
def histogram_boxplot(feature, figsize=(15,10), bins = None):
""" Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2, # Number of rows of the subplot grid= 2
sharex = True, # x-axis will be shared among all subplots
gridspec_kw = {"height_ratios": (.25, .75)},
figsize = figsize
) # creating the 2 subplots
sns.boxplot(feature, ax=ax_box2, showmeans=True, color='violet') # boxplot will be created and a star will indicate the mean value of the column
sns.distplot(feature, kde=F, ax=ax_hist2, bins=bins,color = 'orange') if bins else sns.distplot(feature, kde=False, ax=ax_hist2,color='tab:cyan') # For histogram
ax_hist2.axvline(np.mean(feature), color='purple', linestyle='--') # Add mean to the histogram
ax_hist2.axvline(np.median(feature), color='black', linestyle='-') # Add median to the histogram
histogram_boxplot(loan_data.Age)
histogram_boxplot(loan_data.Experience)
histogram_boxplot(loan_data.Income)
loan_data['Income'].value_counts()
44 85 38 84 81 83 41 82 39 81 40 78 42 77 83 74 43 70 45 69 29 67 21 65 35 65 22 65 85 65 25 64 84 63 28 63 30 63 55 61 82 61 78 61 65 60 64 60 32 58 61 57 53 57 80 56 58 55 62 55 31 55 23 54 34 53 18 53 59 53 79 53 54 52 19 52 49 52 60 52 33 51 70 47 52 47 20 47 24 47 75 47 69 46 63 46 50 45 74 45 48 44 73 44 71 43 51 41 72 41 90 38 91 37 93 37 68 35 113 34 89 34 15 33 13 32 14 31 12 30 114 30 92 29 98 28 115 27 11 27 94 26 9 26 112 26 88 26 95 25 141 24 101 24 99 24 128 24 122 24 125 23 129 23 145 23 8 23 10 23 111 22 154 21 134 20 104 20 149 20 105 20 121 20 140 19 130 19 131 19 118 19 110 19 155 19 119 18 123 18 138 18 135 18 180 18 103 18 158 18 132 18 109 18 120 17 179 17 102 16 108 16 139 16 161 16 195 15 152 15 133 15 142 15 191 13 173 13 182 13 164 13 184 12 170 12 124 12 160 12 183 12 175 12 190 11 172 11 150 11 165 11 148 11 153 11 100 10 162 10 188 10 178 10 163 9 143 9 185 9 174 9 171 9 181 8 194 8 168 8 144 7 169 7 159 7 193 6 192 6 201 5 151 4 200 3 198 3 204 3 199 3 203 2 189 2 202 2 205 2 224 1 218 1 Name: Income, dtype: int64
histogram_boxplot(loan_data.CCAvg)
loan_data['CCAvg'].value_counts()
0.30 241 1.00 231 0.20 204 2.00 188 0.80 187 0.10 183 0.40 179 1.50 178 0.70 169 0.50 163 1.70 158 1.80 152 1.40 136 2.20 130 1.30 128 0.60 118 2.80 110 2.50 107 0.90 106 0.00 106 1.90 106 1.60 101 2.10 100 2.40 92 2.60 87 1.10 84 1.20 66 2.70 58 2.30 58 2.90 54 3.00 53 3.30 45 3.80 43 3.40 39 2.67 36 4.00 33 4.50 29 3.90 27 3.60 27 4.30 26 6.00 26 3.70 25 4.70 24 3.20 22 4.10 22 4.90 22 3.10 20 6.50 18 5.00 18 5.40 18 0.67 18 2.33 18 1.67 18 4.40 17 5.20 16 3.50 15 6.90 14 7.00 14 6.10 14 4.60 14 7.20 13 5.70 13 7.40 13 6.30 13 7.50 12 8.00 12 4.20 11 6.33 10 6.80 10 8.10 10 7.30 10 0.75 9 1.75 9 6.67 9 4.33 9 7.60 9 6.70 9 1.33 9 8.80 9 7.80 9 8.60 8 4.80 7 5.60 7 5.10 6 5.90 5 7.90 4 5.30 4 6.60 4 5.50 4 5.80 3 10.00 3 6.40 3 4.75 2 8.50 2 4.25 2 8.30 2 5.67 2 6.20 2 9.00 2 3.33 1 8.90 1 4.67 1 3.25 1 2.75 1 8.20 1 9.30 1 3.67 1 5.33 1 Name: CCAvg, dtype: int64
histogram_boxplot(loan_data.Mortgage)
# Function to create barplots that indicate percentage for each category.
def perc_on_bar(z):
'''
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
'''
total = len(loan_data[z]) # length of the column
plt.figure(figsize=(15,5))
#plt.xticks(rotation=45)
ax = sns.countplot(loan_data[z],palette='Paired')
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
# x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
# y = p.get_y() + p.get_height() # hieght of the plot
x = p.get_x() + p.get_width() / total + 0.2 # width of the plot
y = p.get_y() + p.get_height() # height of the plot
ax.annotate(percentage,(x, y), size = 10) # annotate the percantage
plt.xticks(rotation = 90)
plt.show() # show the plot
perc_on_bar('Personal_Loan')
perc_on_bar('ZIPCode_County')
perc_on_bar('IncomeRange')
perc_on_bar('AgeRange')
perc_on_bar('ExperienceRange')
perc_on_bar('MortgageRange')
perc_on_bar('Education')
perc_on_bar('Securities_Account')
perc_on_bar('CD_Account')
perc_on_bar('Online')
perc_on_bar('CreditCard')
perc_on_bar('CCAvgRange')
#Plot the heat map to check the correlation between numeric variables
numeric_columns = loan_data.select_dtypes(include=np.number).columns.tolist()
corr = (
loan_data[numeric_columns].corr().sort_values(by=['Personal_Loan'], ascending=False)
) # sorting correlations w.r.t Personal Loan
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(28, 15))
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(
corr,
cmap="seismic",
annot=True,
fmt=".1f",
vmin=-1,
vmax=1,
center=0,
square=False,
linewidths=0.7,
cbar_kws={"shrink": 0.5},
)
<AxesSubplot:>
# Pairplot for all the variables
sns.pairplot(data, hue = 'Personal_Loan')
<seaborn.axisgrid.PairGrid at 0x1725a0f3640>
### Function to plot stacked bar charts for categorical columns
def stacked_plot(x):
sns.set(palette='nipy_spectral')
## crosstab
tab1 = pd.crosstab(x,data['Personal_Loan'],margins=True)
print(tab1)
print('-'*100)
## visualising the cross tab
tab = pd.crosstab(x,data['Personal_Loan'],normalize='index')
tab.plot(kind='bar',stacked=True,figsize=(17,7))
plt.legend(loc='lower left', frameon=False)
plt.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()
stacked_plot(loan_data['Family'])
Personal_Loan 0 1 All Family 1 1365 107 1472 2 1190 106 1296 3 877 133 1010 4 1088 134 1222 All 4520 480 5000 ----------------------------------------------------------------------------------------------------
stacked_plot(loan_data['Education'])
Personal_Loan 0 1 All Education Undergrad 2003 93 2096 Graduate 1221 182 1403 Advanced/Professional 1296 205 1501 All 4520 480 5000 ----------------------------------------------------------------------------------------------------
stacked_plot(loan_data['AgeRange'])
Personal_Loan 0 1 All AgeRange (20, 30] 558 66 624 (30, 40] 1118 118 1236 (40, 50] 1148 122 1270 (50, 60] 1208 115 1323 (60, 70] 488 59 547 All 4520 480 5000 ----------------------------------------------------------------------------------------------------
stacked_plot(loan_data['ExperienceRange'])
Personal_Loan 0 1 All ExperienceRange (0, 5] 513 57 570 (5, 10] 555 69 624 (10, 15] 530 51 581 (15, 20] 605 67 672 (20, 25] 595 59 654 (25, 30] 587 60 647 (30, 35] 587 56 643 (35, 40] 413 47 460 (40, 45] 47 7 54 All 4432 473 4905 ----------------------------------------------------------------------------------------------------
stacked_plot(loan_data['IncomeRange'])
Personal_Loan 0 1 All IncomeRange (0, 50] 1914 0 1914 (50, 100] 1832 42 1874 (100, 150] 550 220 770 (150, 200] 211 215 426 (200, 230] 13 3 16 All 4520 480 5000 ----------------------------------------------------------------------------------------------------
stacked_plot(loan_data['CCAvgRange'])
Personal_Loan 0 1 All CCAvgRange (0, 1] 1760 48 1808 (1, 2] 1286 47 1333 (2, 3] 833 71 904 (3, 4] 207 92 299 (4, 5] 121 83 204 (5, 6] 43 62 105 (6, 7] 77 43 120 (7, 8] 63 19 82 (8, 9] 25 10 35 (9, 10] 0 4 4 All 4415 479 4894 ----------------------------------------------------------------------------------------------------
Customer who has a credit card usage of 9000-10000USD a month are the top loan buyers.However the count of customers is very low(count of 4) and hence they might be classified and capped under outlier detection/treatment.
Customers who has a credit card usage of 5000-6000USD a month are the second most to buy the personal loan closely followed by 4000-5000USD credit card usage category.
Customers whose credit card usage is greater than 3000k a month have higher chances of buying the personal loan.
stacked_plot(loan_data['MortgageRange'])
Personal_Loan 0 1 All MortgageRange (0, 90] 171 8 179 (90, 100] 98 5 103 (100, 200] 719 39 758 (200, 300] 257 40 297 (300, 400] 88 40 128 (400, 500] 28 20 48 (500, 600] 7 14 21 (600, 700] 2 2 4 All 1370 168 1538 ----------------------------------------------------------------------------------------------------
stacked_plot(loan_data['Securities_Account'])
Personal_Loan 0 1 All Securities_Account 0 4058 420 4478 1 462 60 522 All 4520 480 5000 ----------------------------------------------------------------------------------------------------
stacked_plot(loan_data['CD_Account'])
Personal_Loan 0 1 All CD_Account 0 4358 340 4698 1 162 140 302 All 4520 480 5000 ----------------------------------------------------------------------------------------------------
stacked_plot(loan_data['Online'])
Personal_Loan 0 1 All Online 0 1827 189 2016 1 2693 291 2984 All 4520 480 5000 ----------------------------------------------------------------------------------------------------
stacked_plot(loan_data['CreditCard'])
Personal_Loan 0 1 All CreditCard 0 3193 337 3530 1 1327 143 1470 All 4520 480 5000 ----------------------------------------------------------------------------------------------------
stacked_plot(loan_data['ZIPCode_County'])
Personal_Loan 0 1 All ZIPCode_County Alameda County 456 44 500 Butte County 17 2 19 Contra Costa County 73 12 85 El Dorado County 16 1 17 Fresno County 24 2 26 Humboldt County 30 2 32 Imperial County 3 0 3 Kern County 47 7 54 Lake County 4 0 4 Los Angeles County 984 111 1095 Marin County 48 6 54 Mendocino County 7 1 8 Merced County 4 0 4 Monterey County 113 15 128 Napa County 3 0 3 Orange County 309 30 339 Placer County 22 2 24 Riverside County 50 6 56 Sacramento County 169 15 184 San Benito County 14 0 14 San Bernardino County 98 3 101 San Diego County 509 59 568 San Francisco County 238 19 257 San Joaquin County 12 1 13 San Luis Obispo County 28 5 33 San Mateo County 192 12 204 Santa Barbara County 143 11 154 Santa Clara County 492 71 563 Santa Cruz County 60 8 68 Shasta County 15 3 18 Siskiyou County 7 0 7 Solano County 30 3 33 Sonoma County 22 6 28 Stanislaus County 14 1 15 Trinity County 4 0 4 Tuolumne County 7 0 7 Unknown 31 3 34 Ventura County 103 11 114 Yolo County 122 8 130 All 4520 480 5000 ----------------------------------------------------------------------------------------------------
sns.barplot(x = 'Education' , y = 'Income', hue = 'Personal_Loan',data = loan_data )
<AxesSubplot:xlabel='Education', ylabel='Income'>
In the Advanced/Professional Education category, the mean Income of customers who got a loan is 150k and customers who didnot get a loan in this category have mean income of 55k.
In the Graduate Education category, the mean Income of customers who got a loan is 145k and customers who didnot get a loan in this category have mean income of 55k.
In the Undergraduate Education category, the mean Income of customers who got a loan is 130k and customers who didnot get a loan in this category have mean income of 80k.
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)
sns.boxplot(x = 'ZIPCode_County' , y = 'Income',data = loan_data )
plt.xticks(rotation = 90)
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38]),
[Text(0, 0, 'Alameda County'),
Text(1, 0, 'Butte County'),
Text(2, 0, 'Contra Costa County'),
Text(3, 0, 'El Dorado County'),
Text(4, 0, 'Fresno County'),
Text(5, 0, 'Humboldt County'),
Text(6, 0, 'Imperial County'),
Text(7, 0, 'Kern County'),
Text(8, 0, 'Lake County'),
Text(9, 0, 'Los Angeles County'),
Text(10, 0, 'Marin County'),
Text(11, 0, 'Mendocino County'),
Text(12, 0, 'Merced County'),
Text(13, 0, 'Monterey County'),
Text(14, 0, 'Napa County'),
Text(15, 0, 'Orange County'),
Text(16, 0, 'Placer County'),
Text(17, 0, 'Riverside County'),
Text(18, 0, 'Sacramento County'),
Text(19, 0, 'San Benito County'),
Text(20, 0, 'San Bernardino County'),
Text(21, 0, 'San Diego County'),
Text(22, 0, 'San Francisco County'),
Text(23, 0, 'San Joaquin County'),
Text(24, 0, 'San Luis Obispo County'),
Text(25, 0, 'San Mateo County'),
Text(26, 0, 'Santa Barbara County'),
Text(27, 0, 'Santa Clara County'),
Text(28, 0, 'Santa Cruz County'),
Text(29, 0, 'Shasta County'),
Text(30, 0, 'Siskiyou County'),
Text(31, 0, 'Solano County'),
Text(32, 0, 'Sonoma County'),
Text(33, 0, 'Stanislaus County'),
Text(34, 0, 'Trinity County'),
Text(35, 0, 'Tuolumne County'),
Text(36, 0, 'Unknown'),
Text(37, 0, 'Ventura County'),
Text(38, 0, 'Yolo County')])
loan_data.drop(['ID', 'ZIPCode'],axis=1,inplace=True)
loan_data.head()
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | ZIPCode_County | AgeRange | IncomeRange | CCAvgRange | ExperienceRange | MortgageRange | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1.0 | 49 | 4 | 1.6 | Undergrad | 0 | 0 | 1 | 0 | 0 | 0 | Los Angeles County | (20, 30] | (0, 50] | (1, 2] | (0, 5] | NaN |
| 1 | 45 | 19.0 | 34 | 3 | 1.5 | Undergrad | 0 | 0 | 1 | 0 | 0 | 0 | Los Angeles County | (40, 50] | (0, 50] | (1, 2] | (15, 20] | NaN |
| 2 | 39 | 15.0 | 11 | 1 | 1.0 | Undergrad | 0 | 0 | 0 | 0 | 0 | 0 | Alameda County | (30, 40] | (0, 50] | (0, 1] | (10, 15] | NaN |
| 3 | 35 | 9.0 | 100 | 1 | 2.7 | Graduate | 0 | 0 | 0 | 0 | 0 | 0 | San Francisco County | (30, 40] | (50, 100] | (2, 3] | (5, 10] | NaN |
| 4 | 35 | 8.0 | 45 | 4 | 1.0 | Graduate | 0 | 0 | 0 | 0 | 0 | 1 | Los Angeles County | (30, 40] | (0, 50] | (0, 1] | (5, 10] | NaN |
loan_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null float64 2 Income 5000 non-null int64 3 Family 5000 non-null category 4 CCAvg 5000 non-null float64 5 Education 5000 non-null category 6 Mortgage 5000 non-null int64 7 Personal_Loan 5000 non-null int64 8 Securities_Account 5000 non-null category 9 CD_Account 5000 non-null category 10 Online 5000 non-null category 11 CreditCard 5000 non-null category 12 ZIPCode_County 5000 non-null category 13 AgeRange 5000 non-null category 14 IncomeRange 5000 non-null category 15 CCAvgRange 4894 non-null category 16 ExperienceRange 4905 non-null category 17 MortgageRange 1538 non-null category dtypes: category(12), float64(2), int64(4) memory usage: 297.6 KB
numerical_col = loan_data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20,30))
for i, variable in enumerate(numerical_col):
plt.subplot(5,4,i+1)
plt.boxplot(loan_data[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Let's treat using capping method and check again.
def treat_outliers(data,col):
'''
treats outliers in a varaible
col: str, name of the numerical varaible
data: data frame
col: name of the column
'''
Q1=data[col].quantile(0.25) # 25th quantile
Q3=data[col].quantile(0.75) # 75th quantile
IQR=Q3-Q1
Lower_Whisker = Q1 - 1.5*IQR
Upper_Whisker = Q3 + 1.5*IQR
data[col] = np.clip(data[col], Lower_Whisker, Upper_Whisker) # all the values smaller than Lower_Whisker will be assigned value of Lower_whisker
# and all the values above upper_whisker will be assigned value of upper_Whisker
return data
def treat_outliers_all(data, col_list):
'''
treat outlier in all numerical varaibles
col_list: list of numerical varaibles
data: data frame
'''
for c in col_list:
data = treat_outliers(data,c)
return data
numerical_col = loan_data.select_dtypes(include=np.number).columns.tolist()# getting list of numerical columns
numerical_col.remove('Personal_Loan')
print(numerical_col)
loan_data = treat_outliers_all(loan_data,numerical_col)
['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage']
numerical_col = loan_data.select_dtypes(include=np.number).columns.tolist()
numerical_col.remove('Personal_Loan')
plt.figure(figsize=(20,30))
for i, variable in enumerate(numerical_col):
plt.subplot(5,4,i+1)
plt.boxplot(loan_data[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
loan_data['Personal_Loan'].value_counts()
0 4520 1 480 Name: Personal_Loan, dtype: int64
loan_data1 = loan_data.copy()
#Dropping off not needed columns for model
loan_data1.drop(columns=['ExperienceRange','IncomeRange','AgeRange','MortgageRange','CCAvgRange'],inplace = True)
loan_data1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null float64 2 Income 5000 non-null float64 3 Family 5000 non-null category 4 CCAvg 5000 non-null float64 5 Education 5000 non-null category 6 Mortgage 5000 non-null float64 7 Personal_Loan 5000 non-null int64 8 Securities_Account 5000 non-null category 9 CD_Account 5000 non-null category 10 Online 5000 non-null category 11 CreditCard 5000 non-null category 12 ZIPCode_County 5000 non-null category dtypes: category(7), float64(4), int64(2) memory usage: 270.9 KB
def split(*kwargs):
'''
Function to split data into X and Y then one hot encode the X variable.
Returns training and test sets
*kwargs : Variable to remove from the dataset before splitting into X and Y
'''
X = loan_data1.drop([*kwargs], axis=1)
Y = loan_data1[['Personal_Loan']]
X = pd.get_dummies(X,drop_first=True)
X = add_constant(X)
#Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30, random_state = 1)
return X_train,X_test, y_train, y_test
print(loan_data1['Personal_Loan'].value_counts())
0 4520 1 480 Name: Personal_Loan, dtype: int64
X_train,X_test, y_train, y_test = split('Personal_Loan')
def get_metrics_score(model,library,train,test,train_y,test_y,threshold=0.5,flag=True,roc=False):
'''
Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
library: Takes two arguments stats for statsmodels and sklearn for sklearn library
model: classifier to predict values of X
train, test: Independent features
train_y,test_y: Dependent variable
threshold: thresold for classifiying the observation as 1
flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
'''
# defining an empty list to store train and test results
if library=='stats':
score_list=[]
pred_train = (model.predict(train)>threshold)
pred_test = (model.predict(test)>threshold)
pred_train = np.round(pred_train)
pred_test = np.round(pred_test)
train_acc = accuracy_score(pred_train,train_y)
test_acc = accuracy_score(pred_test,test_y)
train_recall = recall_score(train_y,pred_train)
test_recall = recall_score(test_y,pred_test)
train_precision = precision_score(train_y,pred_train)
test_precision = precision_score(test_y,pred_test)
train_f1 = f1_score(train_y,pred_train)
test_f1 = f1_score(test_y,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision,train_f1,test_f1))
elif library=='sklearn':
score_list=[]
pred_train = model.predict(train)
pred_test = model.predict(test)
train_acc = accuracy_score(pred_train,train_y)
test_acc = accuracy_score(pred_test,test_y)
train_recall = recall_score(train_y,pred_train)
test_recall = recall_score(test_y,pred_test)
train_precision = precision_score(train_y,pred_train)
test_precision = precision_score(test_y,pred_test)
train_f1 = f1_score(train_y,pred_train)
test_f1 = f1_score(test_y,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision,train_f1,test_f1))
if flag == True:
print("Accuracy on training set : ",accuracy_score(pred_train,train_y))
print("Accuracy on test set : ",accuracy_score(pred_test,test_y))
print("Recall on training set : ",recall_score(train_y,pred_train))
print("Recall on test set : ",recall_score(test_y,pred_test))
print("Precision on training set : ",precision_score(train_y,pred_train))
print("Precision on test set : ",precision_score(test_y,pred_test))
print("F1 on training set : ",f1_score(train_y,pred_train))
print("F1 on test set : ",f1_score(test_y,pred_test))
if roc == True:
print("ROC-AUC Score on training set : ",roc_auc_score(train_y,pred_train))
print("ROC-AUC Score on test set : ",roc_auc_score(test_y,pred_test))
return score_list # returning the list with train and test scores
def make_confusion_matrix(model,library,test_X,y_actual,threshold=0.5,labels=[0, 1]):
'''
model : classifier to predict values of X
library: Takes two arguments stats for statsmodels and sklearn for sklearn library
test_X: test set
y_actual : ground truth
threshold: thresold for classifiying the observation as 1
'''
if library == 'sklearn':
y_predict = model.predict(test_X)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0,1])
df_cm = pd.DataFrame(cm, index = [i for i in [" Actual No"," Actual Yes"]],
columns = [i for i in ["Predicted No","Predicted Yes"]])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
elif library =='stats':
y_predict = model.predict(test_X)>threshold
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0,1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual No","Actual Yes"]],
columns = [i for i in ["Predicted No","Predicted Yes"]])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
# There are different solvers available in Sklearn logistic regression
# The newton-cg solver is faster for high-dimensional data
print(X_train.shape)
print(y_train.head())
print(y_train.value_counts())
lr = LogisticRegression(solver='newton-cg',random_state=42,fit_intercept=False)
model = lr.fit(X_train,y_train)
# confusion matrix
make_confusion_matrix(lr,'sklearn',X_test,y_test)
# Let's check model performances for this model
scores_LR = get_metrics_score(model,'sklearn',X_train,X_test,y_train,y_test)
(3500, 53)
Personal_Loan
1334 0
4768 0
65 0
177 0
4489 0
Personal_Loan
0 3169
1 331
dtype: int64
Accuracy on training set : 0.9651428571428572
Accuracy on test set : 0.9526666666666667
Recall on training set : 0.7099697885196374
Recall on test set : 0.6040268456375839
Precision on training set : 0.9003831417624522
Precision on test set : 0.8823529411764706
F1 on training set : 0.793918918918919
F1 on test set : 0.7171314741035856
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit(warn_convergence =False)
print(X_train.head())
# Let's check model performances for this model
scores_LR = get_metrics_score(lg,'stats',X_train,X_test,y_train,y_test)
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.094952
Iterations: 35
const Age Experience Income CCAvg Mortgage Family_2 Family_3 \
1334 1.0 47 22.0 35.0 1.3 0.0 1 0
4768 1.0 38 14.0 39.0 2.0 0.0 0 0
65 1.0 59 35.0 131.0 3.8 0.0 0 0
177 1.0 29 3.0 65.0 1.8 244.0 0 0
4489 1.0 39 13.0 21.0 0.2 0.0 0 1
Family_4 Education_Graduate Education_Advanced/Professional \
1334 0 0 0
4768 0 1 0
65 0 0 0
177 1 1 0
4489 0 1 0
Securities_Account_1 CD_Account_1 Online_1 CreditCard_1 \
1334 0 0 1 0
4768 0 0 1 0
65 0 0 1 1
177 0 0 0 0
4489 0 0 1 0
ZIPCode_County_Butte County ZIPCode_County_Contra Costa County \
1334 0 0
4768 0 0
65 0 0
177 0 0
4489 0 0
ZIPCode_County_El Dorado County ZIPCode_County_Fresno County \
1334 0 0
4768 0 0
65 0 0
177 0 0
4489 0 0
ZIPCode_County_Humboldt County ZIPCode_County_Imperial County \
1334 0 0
4768 0 0
65 0 0
177 0 0
4489 1 0
ZIPCode_County_Kern County ZIPCode_County_Lake County \
1334 0 0
4768 0 0
65 0 0
177 0 0
4489 0 0
ZIPCode_County_Los Angeles County ZIPCode_County_Marin County \
1334 0 0
4768 0 0
65 0 0
177 0 0
4489 0 0
ZIPCode_County_Mendocino County ZIPCode_County_Merced County \
1334 0 0
4768 0 0
65 0 0
177 0 0
4489 0 0
ZIPCode_County_Monterey County ZIPCode_County_Napa County \
1334 0 0
4768 0 0
65 0 0
177 0 0
4489 0 0
ZIPCode_County_Orange County ZIPCode_County_Placer County \
1334 0 0
4768 0 0
65 0 0
177 0 0
4489 0 0
ZIPCode_County_Riverside County ZIPCode_County_Sacramento County \
1334 0 0
4768 0 0
65 0 0
177 0 0
4489 0 0
ZIPCode_County_San Benito County ZIPCode_County_San Bernardino County \
1334 0 0
4768 0 0
65 0 0
177 0 0
4489 0 0
ZIPCode_County_San Diego County ZIPCode_County_San Francisco County \
1334 0 0
4768 0 0
65 0 0
177 0 1
4489 0 0
ZIPCode_County_San Joaquin County \
1334 0
4768 0
65 0
177 0
4489 0
ZIPCode_County_San Luis Obispo County ZIPCode_County_San Mateo County \
1334 0 0
4768 0 0
65 0 0
177 0 0
4489 0 0
ZIPCode_County_Santa Barbara County ZIPCode_County_Santa Clara County \
1334 0 1
4768 1 0
65 0 0
177 0 0
4489 0 0
ZIPCode_County_Santa Cruz County ZIPCode_County_Shasta County \
1334 0 0
4768 0 0
65 0 0
177 0 0
4489 0 0
ZIPCode_County_Siskiyou County ZIPCode_County_Solano County \
1334 0 0
4768 0 0
65 0 0
177 0 0
4489 0 0
ZIPCode_County_Sonoma County ZIPCode_County_Stanislaus County \
1334 0 0
4768 0 0
65 0 0
177 0 0
4489 0 0
ZIPCode_County_Trinity County ZIPCode_County_Tuolumne County \
1334 0 0
4768 0 0
65 0 0
177 0 0
4489 0 0
ZIPCode_County_Unknown ZIPCode_County_Ventura County \
1334 0 0
4768 0 0
65 0 1
177 0 0
4489 0 0
ZIPCode_County_Yolo County
1334 0
4768 0
65 0
177 0
4489 0
Accuracy on training set : 0.9691428571428572
Accuracy on test set : 0.9573333333333334
Recall on training set : 0.7522658610271903
Recall on test set : 0.6510067114093959
Precision on training set : 0.9054545454545454
Precision on test set : 0.8899082568807339
F1 on training set : 0.8217821782178217
F1 on test set : 0.751937984496124
lg.summary()
| Dep. Variable: | Personal_Loan | No. Observations: | 3500 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 3447 |
| Method: | MLE | Df Model: | 52 |
| Date: | Sat, 17 Jul 2021 | Pseudo R-squ.: | 0.6966 |
| Time: | 05:20:15 | Log-Likelihood: | -332.33 |
| converged: | False | LL-Null: | -1095.5 |
| Covariance Type: | nonrobust | LLR p-value: | 2.923e-285 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -15.3189 | 2.425 | -6.317 | 0.000 | -20.072 | -10.566 |
| Age | -0.0071 | 0.087 | -0.082 | 0.935 | -0.177 | 0.163 |
| Experience | 0.0156 | 0.087 | 0.180 | 0.857 | -0.154 | 0.185 |
| Income | 0.0683 | 0.004 | 15.474 | 0.000 | 0.060 | 0.077 |
| CCAvg | 0.5492 | 0.085 | 6.476 | 0.000 | 0.383 | 0.715 |
| Mortgage | 0.0014 | 0.001 | 1.351 | 0.177 | -0.001 | 0.004 |
| Family_2 | 0.1266 | 0.310 | 0.409 | 0.683 | -0.481 | 0.734 |
| Family_3 | 2.9368 | 0.357 | 8.235 | 0.000 | 2.238 | 3.636 |
| Family_4 | 1.9064 | 0.340 | 5.599 | 0.000 | 1.239 | 2.574 |
| Education_Graduate | 4.3257 | 0.378 | 11.453 | 0.000 | 3.585 | 5.066 |
| Education_Advanced/Professional | 4.5974 | 0.379 | 12.117 | 0.000 | 3.854 | 5.341 |
| Securities_Account_1 | -1.0263 | 0.440 | -2.334 | 0.020 | -1.888 | -0.164 |
| CD_Account_1 | 3.8575 | 0.481 | 8.012 | 0.000 | 2.914 | 4.801 |
| Online_1 | -0.6704 | 0.223 | -3.008 | 0.003 | -1.107 | -0.234 |
| CreditCard_1 | -1.1348 | 0.299 | -3.793 | 0.000 | -1.721 | -0.548 |
| ZIPCode_County_Butte County | -21.2200 | 1.4e+05 | -0.000 | 1.000 | -2.75e+05 | 2.75e+05 |
| ZIPCode_County_Contra Costa County | 0.3337 | 0.917 | 0.364 | 0.716 | -1.464 | 2.132 |
| ZIPCode_County_El Dorado County | -0.4459 | 1.674 | -0.266 | 0.790 | -3.727 | 2.835 |
| ZIPCode_County_Fresno County | -0.5812 | 2.201 | -0.264 | 0.792 | -4.896 | 3.733 |
| ZIPCode_County_Humboldt County | -1.0884 | 1.968 | -0.553 | 0.580 | -4.945 | 2.769 |
| ZIPCode_County_Imperial County | -14.1192 | 2.63e+04 | -0.001 | 1.000 | -5.15e+04 | 5.14e+04 |
| ZIPCode_County_Kern County | 1.6597 | 0.834 | 1.989 | 0.047 | 0.024 | 3.295 |
| ZIPCode_County_Lake County | -11.7725 | 4588.339 | -0.003 | 0.998 | -9004.752 | 8981.207 |
| ZIPCode_County_Los Angeles County | 0.2347 | 0.405 | 0.580 | 0.562 | -0.559 | 1.028 |
| ZIPCode_County_Marin County | 0.6335 | 0.951 | 0.666 | 0.505 | -1.230 | 2.497 |
| ZIPCode_County_Mendocino County | -2.5238 | 5.897 | -0.428 | 0.669 | -14.082 | 9.035 |
| ZIPCode_County_Merced County | -19.7340 | 7.47e+04 | -0.000 | 1.000 | -1.46e+05 | 1.46e+05 |
| ZIPCode_County_Monterey County | -0.1177 | 0.752 | -0.157 | 0.876 | -1.592 | 1.357 |
| ZIPCode_County_Napa County | -19.1340 | 3.4e+05 | -5.62e-05 | 1.000 | -6.67e+05 | 6.67e+05 |
| ZIPCode_County_Orange County | 0.2290 | 0.528 | 0.434 | 0.665 | -0.806 | 1.264 |
| ZIPCode_County_Placer County | 1.3373 | 1.089 | 1.228 | 0.219 | -0.797 | 3.472 |
| ZIPCode_County_Riverside County | 2.5941 | 0.873 | 2.971 | 0.003 | 0.883 | 4.305 |
| ZIPCode_County_Sacramento County | 0.4258 | 0.632 | 0.673 | 0.501 | -0.813 | 1.665 |
| ZIPCode_County_San Benito County | -7.7159 | 156.256 | -0.049 | 0.961 | -313.973 | 298.541 |
| ZIPCode_County_San Bernardino County | -0.8796 | 1.139 | -0.772 | 0.440 | -3.112 | 1.352 |
| ZIPCode_County_San Diego County | 0.2071 | 0.462 | 0.448 | 0.654 | -0.699 | 1.113 |
| ZIPCode_County_San Francisco County | 0.4421 | 0.569 | 0.778 | 0.437 | -0.672 | 1.556 |
| ZIPCode_County_San Joaquin County | -0.2464 | 10.357 | -0.024 | 0.981 | -20.545 | 20.052 |
| ZIPCode_County_San Luis Obispo County | -1.7462 | 2.378 | -0.734 | 0.463 | -6.407 | 2.915 |
| ZIPCode_County_San Mateo County | -1.1198 | 0.695 | -1.612 | 0.107 | -2.482 | 0.242 |
| ZIPCode_County_Santa Barbara County | 0.7105 | 0.664 | 1.070 | 0.285 | -0.591 | 2.012 |
| ZIPCode_County_Santa Clara County | 0.4167 | 0.456 | 0.913 | 0.361 | -0.478 | 1.311 |
| ZIPCode_County_Santa Cruz County | -0.0865 | 0.943 | -0.092 | 0.927 | -1.936 | 1.763 |
| ZIPCode_County_Shasta County | -4.4148 | 11.896 | -0.371 | 0.711 | -27.731 | 18.902 |
| ZIPCode_County_Siskiyou County | -55.4115 | 6.54e+12 | -8.48e-12 | 1.000 | -1.28e+13 | 1.28e+13 |
| ZIPCode_County_Solano County | 1.0235 | 1.168 | 0.877 | 0.381 | -1.265 | 3.312 |
| ZIPCode_County_Sonoma County | 1.5505 | 1.242 | 1.249 | 0.212 | -0.883 | 3.984 |
| ZIPCode_County_Stanislaus County | -18.4095 | 1.09e+04 | -0.002 | 0.999 | -2.15e+04 | 2.14e+04 |
| ZIPCode_County_Trinity County | -22.4167 | 2.53e+05 | -8.86e-05 | 1.000 | -4.96e+05 | 4.96e+05 |
| ZIPCode_County_Tuolumne County | -20.5327 | 1.39e+05 | -0.000 | 1.000 | -2.73e+05 | 2.73e+05 |
| ZIPCode_County_Unknown | 0.7067 | 1.185 | 0.596 | 0.551 | -1.616 | 3.029 |
| ZIPCode_County_Ventura County | 0.1459 | 0.701 | 0.208 | 0.835 | -1.228 | 1.519 |
| ZIPCode_County_Yolo County | -0.3196 | 0.789 | -0.405 | 0.685 | -1.866 | 1.227 |
# changing datatype of colums to numeric for checking vif
X_train_num = X_train.astype(float).copy()
vif_series1 = pd.Series([variance_inflation_factor(X_train_num.values,i) for i in range(X_train_num.shape[1])],index=X_train_num.columns, dtype = float)
print('Series before feature selection: \n\n{}\n'.format(vif_series1))
Series before feature selection: const 480.470335 Age 96.529501 Experience 96.428494 Income 1.857393 CCAvg 1.728553 Mortgage 1.032745 Family_2 1.420346 Family_3 1.399994 Family_4 1.442593 Education_Graduate 1.317949 Education_Advanced/Professional 1.344078 Securities_Account_1 1.160256 CD_Account_1 1.376347 Online_1 1.055947 CreditCard_1 1.126163 ZIPCode_County_Butte County 1.036645 ZIPCode_County_Contra Costa County 1.142392 ZIPCode_County_El Dorado County 1.030441 ZIPCode_County_Fresno County 1.034913 ZIPCode_County_Humboldt County 1.064358 ZIPCode_County_Imperial County 1.007932 ZIPCode_County_Kern County 1.095017 ZIPCode_County_Lake County 1.014240 ZIPCode_County_Los Angeles County 2.430448 ZIPCode_County_Marin County 1.093496 ZIPCode_County_Mendocino County 1.024039 ZIPCode_County_Merced County 1.010611 ZIPCode_County_Monterey County 1.218278 ZIPCode_County_Napa County 1.011166 ZIPCode_County_Orange County 1.561951 ZIPCode_County_Placer County 1.046807 ZIPCode_County_Riverside County 1.089932 ZIPCode_County_Sacramento County 1.327414 ZIPCode_County_San Benito County 1.030028 ZIPCode_County_San Bernardino County 1.182119 ZIPCode_County_San Diego County 1.822014 ZIPCode_County_San Francisco County 1.439141 ZIPCode_County_San Joaquin County 1.016504 ZIPCode_County_San Luis Obispo County 1.053845 ZIPCode_County_San Mateo County 1.342552 ZIPCode_County_Santa Barbara County 1.246172 ZIPCode_County_Santa Clara County 1.840647 ZIPCode_County_Santa Cruz County 1.127744 ZIPCode_County_Shasta County 1.022106 ZIPCode_County_Siskiyou County 1.015515 ZIPCode_County_Solano County 1.068509 ZIPCode_County_Sonoma County 1.061841 ZIPCode_County_Stanislaus County 1.027770 ZIPCode_County_Trinity County 1.010708 ZIPCode_County_Tuolumne County 1.012719 ZIPCode_County_Unknown 1.067991 ZIPCode_County_Ventura County 1.194736 ZIPCode_County_Yolo County 1.210263 dtype: float64
data.corr()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | 1.000000 | -0.008473 | -0.008326 | -0.017695 | 0.002240 | -0.016797 | -0.024675 | 0.021463 | -0.013920 | -0.024801 | -0.016972 | -0.006909 | -0.002528 | 0.017028 |
| Age | -0.008473 | 1.000000 | 0.994215 | -0.055269 | -0.030530 | -0.046418 | -0.052012 | 0.041334 | -0.012539 | -0.007726 | -0.000436 | 0.008043 | 0.013702 | 0.007681 |
| Experience | -0.008326 | 0.994215 | 1.000000 | -0.046574 | -0.030456 | -0.052563 | -0.050077 | 0.013152 | -0.010582 | -0.007413 | -0.001232 | 0.010353 | 0.013898 | 0.008967 |
| Income | -0.017695 | -0.055269 | -0.046574 | 1.000000 | -0.030709 | -0.157501 | 0.645984 | -0.187524 | 0.206806 | 0.502462 | -0.002616 | 0.169738 | 0.014206 | -0.002385 |
| ZIPCode | 0.002240 | -0.030530 | -0.030456 | -0.030709 | 1.000000 | 0.027512 | -0.012188 | -0.008266 | 0.003614 | -0.002974 | 0.002422 | 0.021671 | 0.028317 | 0.024033 |
| Family | -0.016797 | -0.046418 | -0.052563 | -0.157501 | 0.027512 | 1.000000 | -0.109275 | 0.064929 | -0.020445 | 0.061367 | 0.019994 | 0.014110 | 0.010354 | 0.011588 |
| CCAvg | -0.024675 | -0.052012 | -0.050077 | 0.645984 | -0.012188 | -0.109275 | 1.000000 | -0.136124 | 0.109905 | 0.366889 | 0.015086 | 0.136534 | -0.003611 | -0.006689 |
| Education | 0.021463 | 0.041334 | 0.013152 | -0.187524 | -0.008266 | 0.064929 | -0.136124 | 1.000000 | -0.033327 | 0.136722 | -0.010812 | 0.013934 | -0.015004 | -0.011014 |
| Mortgage | -0.013920 | -0.012539 | -0.010582 | 0.206806 | 0.003614 | -0.020445 | 0.109905 | -0.033327 | 1.000000 | 0.142095 | -0.005411 | 0.089311 | -0.005995 | -0.007231 |
| Personal_Loan | -0.024801 | -0.007726 | -0.007413 | 0.502462 | -0.002974 | 0.061367 | 0.366889 | 0.136722 | 0.142095 | 1.000000 | 0.021954 | 0.316355 | 0.006278 | 0.002802 |
| Securities_Account | -0.016972 | -0.000436 | -0.001232 | -0.002616 | 0.002422 | 0.019994 | 0.015086 | -0.010812 | -0.005411 | 0.021954 | 1.000000 | 0.317034 | 0.012627 | -0.015028 |
| CD_Account | -0.006909 | 0.008043 | 0.010353 | 0.169738 | 0.021671 | 0.014110 | 0.136534 | 0.013934 | 0.089311 | 0.316355 | 0.317034 | 1.000000 | 0.175880 | 0.278644 |
| Online | -0.002528 | 0.013702 | 0.013898 | 0.014206 | 0.028317 | 0.010354 | -0.003611 | -0.015004 | -0.005995 | 0.006278 | 0.012627 | 0.175880 | 1.000000 | 0.004210 |
| CreditCard | 0.017028 | 0.007681 | 0.008967 | -0.002385 | 0.024033 | 0.011588 | -0.006689 | -0.011014 | -0.007231 | 0.002802 | -0.015028 | 0.278644 | 0.004210 | 1.000000 |
loan_data1.corr()
| Age | Experience | Income | CCAvg | Mortgage | Personal_Loan | |
|---|---|---|---|---|---|---|
| Age | 1.000000 | 0.994214 | -0.054988 | -0.052032 | -0.012033 | -0.007726 |
| Experience | 0.994214 | 1.000000 | -0.046579 | -0.050633 | -0.010910 | -0.008060 |
| Income | -0.054988 | -0.046579 | 1.000000 | 0.637869 | 0.135018 | 0.504559 |
| CCAvg | -0.052032 | -0.050633 | 0.637869 | 1.000000 | 0.068329 | 0.383306 |
| Mortgage | -0.012033 | -0.010910 | 0.135018 | 0.068329 | 1.000000 | 0.092989 |
| Personal_Loan | -0.007726 | -0.008060 | 0.504559 | 0.383306 | 0.092989 | 1.000000 |
X_train_num1 = X_train_num.drop('Age',axis=1)
vif_series2 = pd.Series([variance_inflation_factor(X_train_num1.values,i) for i in range(X_train_num1.shape[1])],index=X_train_num1.columns)
print('Series before feature selection: \n\n{}\n'.format(vif_series2))
Series before feature selection: const 23.982852 Experience 1.025286 Income 1.851851 CCAvg 1.722141 Mortgage 1.032711 Family_2 1.420092 Family_3 1.396478 Family_4 1.442591 Education_Graduate 1.304460 Education_Advanced/Professional 1.264031 Securities_Account_1 1.159648 CD_Account_1 1.375225 Online_1 1.055780 CreditCard_1 1.126150 ZIPCode_County_Butte County 1.036377 ZIPCode_County_Contra Costa County 1.142107 ZIPCode_County_El Dorado County 1.030323 ZIPCode_County_Fresno County 1.034913 ZIPCode_County_Humboldt County 1.063657 ZIPCode_County_Imperial County 1.007259 ZIPCode_County_Kern County 1.094919 ZIPCode_County_Lake County 1.013983 ZIPCode_County_Los Angeles County 2.430049 ZIPCode_County_Marin County 1.092781 ZIPCode_County_Mendocino County 1.024024 ZIPCode_County_Merced County 1.010300 ZIPCode_County_Monterey County 1.218278 ZIPCode_County_Napa County 1.011066 ZIPCode_County_Orange County 1.559526 ZIPCode_County_Placer County 1.046721 ZIPCode_County_Riverside County 1.089858 ZIPCode_County_Sacramento County 1.327360 ZIPCode_County_San Benito County 1.030028 ZIPCode_County_San Bernardino County 1.182021 ZIPCode_County_San Diego County 1.820682 ZIPCode_County_San Francisco County 1.439105 ZIPCode_County_San Joaquin County 1.016468 ZIPCode_County_San Luis Obispo County 1.052913 ZIPCode_County_San Mateo County 1.342465 ZIPCode_County_Santa Barbara County 1.245878 ZIPCode_County_Santa Clara County 1.840491 ZIPCode_County_Santa Cruz County 1.126971 ZIPCode_County_Shasta County 1.022048 ZIPCode_County_Siskiyou County 1.014996 ZIPCode_County_Solano County 1.068492 ZIPCode_County_Sonoma County 1.061838 ZIPCode_County_Stanislaus County 1.027770 ZIPCode_County_Trinity County 1.010515 ZIPCode_County_Tuolumne County 1.012560 ZIPCode_County_Unknown 1.067665 ZIPCode_County_Ventura County 1.194649 ZIPCode_County_Yolo County 1.210233 dtype: float64
X_train1,X_test1,y_train,y_test = split('Personal_Loan','Age','Mortgage')
X_train_num2 = X_train_num1.drop('Mortgage',axis=1)
vif_series3 = pd.Series([variance_inflation_factor(X_train_num2.values,i) for i in range(X_train_num2.shape[1])],index=X_train_num2.columns)
print('Series before feature selection: \n\n{}\n'.format(vif_series3))
Series before feature selection: const 23.796695 Experience 1.025170 Income 1.833279 CCAvg 1.720845 Family_2 1.419437 Family_3 1.395628 Family_4 1.442341 Education_Graduate 1.303756 Education_Advanced/Professional 1.263993 Securities_Account_1 1.159252 CD_Account_1 1.370972 Online_1 1.055616 CreditCard_1 1.125725 ZIPCode_County_Butte County 1.036235 ZIPCode_County_Contra Costa County 1.142053 ZIPCode_County_El Dorado County 1.030185 ZIPCode_County_Fresno County 1.034767 ZIPCode_County_Humboldt County 1.063298 ZIPCode_County_Imperial County 1.006753 ZIPCode_County_Kern County 1.094910 ZIPCode_County_Lake County 1.013424 ZIPCode_County_Los Angeles County 2.429873 ZIPCode_County_Marin County 1.092737 ZIPCode_County_Mendocino County 1.023767 ZIPCode_County_Merced County 1.009934 ZIPCode_County_Monterey County 1.217994 ZIPCode_County_Napa County 1.010585 ZIPCode_County_Orange County 1.559525 ZIPCode_County_Placer County 1.046184 ZIPCode_County_Riverside County 1.089845 ZIPCode_County_Sacramento County 1.326676 ZIPCode_County_San Benito County 1.029624 ZIPCode_County_San Bernardino County 1.181529 ZIPCode_County_San Diego County 1.820681 ZIPCode_County_San Francisco County 1.438897 ZIPCode_County_San Joaquin County 1.016463 ZIPCode_County_San Luis Obispo County 1.052040 ZIPCode_County_San Mateo County 1.341892 ZIPCode_County_Santa Barbara County 1.244913 ZIPCode_County_Santa Clara County 1.840444 ZIPCode_County_Santa Cruz County 1.125783 ZIPCode_County_Shasta County 1.022040 ZIPCode_County_Siskiyou County 1.014690 ZIPCode_County_Solano County 1.068408 ZIPCode_County_Sonoma County 1.061616 ZIPCode_County_Stanislaus County 1.026329 ZIPCode_County_Trinity County 1.010163 ZIPCode_County_Tuolumne County 1.011624 ZIPCode_County_Unknown 1.067597 ZIPCode_County_Ventura County 1.194543 ZIPCode_County_Yolo County 1.210207 dtype: float64
print(X_test.head())
const Age Experience Income CCAvg Mortgage Family_2 Family_3 \
2764 1.0 31 5.0 84.0 2.9 105.0 0 0
4767 1.0 35 9.0 45.0 0.9 101.0 0 1
3814 1.0 34 9.0 35.0 1.3 0.0 0 1
3499 1.0 49 23.0 114.0 0.3 252.5 0 0
2735 1.0 36 12.0 70.0 2.6 165.0 0 1
Family_4 Education_Graduate Education_Advanced/Professional \
2764 0 0 1
4767 0 0 0
3814 0 0 0
3499 0 0 0
2735 0 1 0
Securities_Account_1 CD_Account_1 Online_1 CreditCard_1 \
2764 0 0 0 1
4767 1 0 0 0
3814 0 0 0 0
3499 0 0 1 0
2735 0 0 1 0
ZIPCode_County_Butte County ZIPCode_County_Contra Costa County \
2764 0 0
4767 0 0
3814 0 0
3499 0 0
2735 0 0
ZIPCode_County_El Dorado County ZIPCode_County_Fresno County \
2764 0 0
4767 0 0
3814 0 0
3499 0 0
2735 0 0
ZIPCode_County_Humboldt County ZIPCode_County_Imperial County \
2764 0 0
4767 0 0
3814 0 0
3499 0 0
2735 0 0
ZIPCode_County_Kern County ZIPCode_County_Lake County \
2764 0 0
4767 0 0
3814 0 0
3499 0 0
2735 0 0
ZIPCode_County_Los Angeles County ZIPCode_County_Marin County \
2764 0 0
4767 1 0
3814 0 0
3499 0 0
2735 0 0
ZIPCode_County_Mendocino County ZIPCode_County_Merced County \
2764 0 0
4767 0 0
3814 0 0
3499 0 0
2735 0 0
ZIPCode_County_Monterey County ZIPCode_County_Napa County \
2764 0 0
4767 0 0
3814 0 0
3499 0 0
2735 0 0
ZIPCode_County_Orange County ZIPCode_County_Placer County \
2764 0 0
4767 0 0
3814 0 0
3499 0 0
2735 0 0
ZIPCode_County_Riverside County ZIPCode_County_Sacramento County \
2764 0 0
4767 0 0
3814 0 0
3499 0 0
2735 0 0
ZIPCode_County_San Benito County ZIPCode_County_San Bernardino County \
2764 0 0
4767 0 0
3814 0 0
3499 0 0
2735 0 0
ZIPCode_County_San Diego County ZIPCode_County_San Francisco County \
2764 0 0
4767 0 0
3814 0 0
3499 0 0
2735 1 0
ZIPCode_County_San Joaquin County \
2764 0
4767 0
3814 0
3499 0
2735 0
ZIPCode_County_San Luis Obispo County ZIPCode_County_San Mateo County \
2764 0 0
4767 0 0
3814 0 0
3499 0 0
2735 0 0
ZIPCode_County_Santa Barbara County ZIPCode_County_Santa Clara County \
2764 0 0
4767 0 0
3814 0 1
3499 0 0
2735 0 0
ZIPCode_County_Santa Cruz County ZIPCode_County_Shasta County \
2764 0 0
4767 0 0
3814 0 0
3499 0 0
2735 0 0
ZIPCode_County_Siskiyou County ZIPCode_County_Solano County \
2764 0 0
4767 0 0
3814 0 0
3499 0 0
2735 0 0
ZIPCode_County_Sonoma County ZIPCode_County_Stanislaus County \
2764 0 0
4767 0 0
3814 0 0
3499 0 0
2735 0 0
ZIPCode_County_Trinity County ZIPCode_County_Tuolumne County \
2764 0 0
4767 0 0
3814 0 0
3499 0 0
2735 0 0
ZIPCode_County_Unknown ZIPCode_County_Ventura County \
2764 0 1
4767 0 0
3814 0 0
3499 0 0
2735 0 0
ZIPCode_County_Yolo County
2764 0
4767 0
3814 0
3499 0
2735 0
logit1 = sm.Logit(y_train, X_train1.astype(float))
lg1 = logit1.fit(warn_convergence =False)
# Let's check model performances for this model
scores_LR = get_metrics_score(lg1,'stats',X_train1,X_test1,y_train,y_test)
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.095209
Iterations: 35
Accuracy on training set : 0.9691428571428572
Accuracy on test set : 0.9566666666666667
Recall on training set : 0.7522658610271903
Recall on test set : 0.6510067114093959
Precision on training set : 0.9054545454545454
Precision on test set : 0.8818181818181818
F1 on training set : 0.8217821782178217
F1 on test set : 0.749034749034749
lg1.summary()
| Dep. Variable: | Personal_Loan | No. Observations: | 3500 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 3449 |
| Method: | MLE | Df Model: | 50 |
| Date: | Sat, 17 Jul 2021 | Pseudo R-squ.: | 0.6958 |
| Time: | 05:20:18 | Log-Likelihood: | -333.23 |
| converged: | False | LL-Null: | -1095.5 |
| Covariance Type: | nonrobust | LLR p-value: | 2.290e-286 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -15.4126 | 0.972 | -15.856 | 0.000 | -17.318 | -13.507 |
| Experience | 0.0077 | 0.009 | 0.836 | 0.403 | -0.010 | 0.026 |
| Income | 0.0687 | 0.004 | 15.633 | 0.000 | 0.060 | 0.077 |
| CCAvg | 0.5381 | 0.084 | 6.398 | 0.000 | 0.373 | 0.703 |
| Family_2 | 0.1379 | 0.309 | 0.446 | 0.656 | -0.468 | 0.744 |
| Family_3 | 2.9563 | 0.357 | 8.287 | 0.000 | 2.257 | 3.656 |
| Family_4 | 1.9124 | 0.341 | 5.614 | 0.000 | 1.245 | 2.580 |
| Education_Graduate | 4.2874 | 0.375 | 11.436 | 0.000 | 3.553 | 5.022 |
| Education_Advanced/Professional | 4.5660 | 0.375 | 12.181 | 0.000 | 3.831 | 5.301 |
| Securities_Account_1 | -1.0083 | 0.440 | -2.292 | 0.022 | -1.871 | -0.146 |
| CD_Account_1 | 3.8589 | 0.482 | 8.013 | 0.000 | 2.915 | 4.803 |
| Online_1 | -0.6614 | 0.222 | -2.977 | 0.003 | -1.097 | -0.226 |
| CreditCard_1 | -1.1214 | 0.297 | -3.773 | 0.000 | -1.704 | -0.539 |
| ZIPCode_County_Butte County | -21.1777 | 1.36e+05 | -0.000 | 1.000 | -2.67e+05 | 2.67e+05 |
| ZIPCode_County_Contra Costa County | 0.3559 | 0.920 | 0.387 | 0.699 | -1.447 | 2.159 |
| ZIPCode_County_El Dorado County | -0.4467 | 1.720 | -0.260 | 0.795 | -3.818 | 2.924 |
| ZIPCode_County_Fresno County | -0.6593 | 2.200 | -0.300 | 0.764 | -4.970 | 3.652 |
| ZIPCode_County_Humboldt County | -1.1563 | 1.961 | -0.590 | 0.555 | -5.000 | 2.687 |
| ZIPCode_County_Imperial County | -13.9330 | 2.47e+04 | -0.001 | 1.000 | -4.84e+04 | 4.84e+04 |
| ZIPCode_County_Kern County | 1.6913 | 0.840 | 2.013 | 0.044 | 0.045 | 3.338 |
| ZIPCode_County_Lake County | -13.4409 | 1.16e+04 | -0.001 | 0.999 | -2.27e+04 | 2.26e+04 |
| ZIPCode_County_Los Angeles County | 0.2310 | 0.404 | 0.571 | 0.568 | -0.562 | 1.024 |
| ZIPCode_County_Marin County | 0.6001 | 0.943 | 0.637 | 0.524 | -1.247 | 2.447 |
| ZIPCode_County_Mendocino County | -2.4034 | 5.408 | -0.444 | 0.657 | -13.002 | 8.195 |
| ZIPCode_County_Merced County | -13.4295 | 2957.103 | -0.005 | 0.996 | -5809.245 | 5782.386 |
| ZIPCode_County_Monterey County | -0.0531 | 0.749 | -0.071 | 0.943 | -1.521 | 1.415 |
| ZIPCode_County_Napa County | -7.1410 | 859.641 | -0.008 | 0.993 | -1692.007 | 1677.725 |
| ZIPCode_County_Orange County | 0.2392 | 0.527 | 0.454 | 0.650 | -0.795 | 1.273 |
| ZIPCode_County_Placer County | 1.2601 | 1.082 | 1.164 | 0.244 | -0.861 | 3.381 |
| ZIPCode_County_Riverside County | 2.5560 | 0.875 | 2.922 | 0.003 | 0.841 | 4.271 |
| ZIPCode_County_Sacramento County | 0.4527 | 0.633 | 0.715 | 0.474 | -0.787 | 1.693 |
| ZIPCode_County_San Benito County | -14.7128 | 5063.679 | -0.003 | 0.998 | -9939.341 | 9909.916 |
| ZIPCode_County_San Bernardino County | -0.8294 | 1.131 | -0.733 | 0.463 | -3.047 | 1.388 |
| ZIPCode_County_San Diego County | 0.2313 | 0.461 | 0.502 | 0.616 | -0.672 | 1.135 |
| ZIPCode_County_San Francisco County | 0.4979 | 0.567 | 0.878 | 0.380 | -0.614 | 1.609 |
| ZIPCode_County_San Joaquin County | -0.2615 | 10.649 | -0.025 | 0.980 | -21.133 | 20.610 |
| ZIPCode_County_San Luis Obispo County | -1.8121 | 2.361 | -0.768 | 0.443 | -6.440 | 2.815 |
| ZIPCode_County_San Mateo County | -1.1274 | 0.698 | -1.615 | 0.106 | -2.495 | 0.240 |
| ZIPCode_County_Santa Barbara County | 0.7760 | 0.659 | 1.177 | 0.239 | -0.516 | 2.068 |
| ZIPCode_County_Santa Clara County | 0.4029 | 0.456 | 0.883 | 0.377 | -0.491 | 1.297 |
| ZIPCode_County_Santa Cruz County | 0.0227 | 0.924 | 0.025 | 0.980 | -1.788 | 1.833 |
| ZIPCode_County_Shasta County | -4.3023 | 10.931 | -0.394 | 0.694 | -25.727 | 17.123 |
| ZIPCode_County_Siskiyou County | -34.6562 | 2.02e+08 | -1.72e-07 | 1.000 | -3.96e+08 | 3.96e+08 |
| ZIPCode_County_Solano County | 1.0081 | 1.185 | 0.851 | 0.395 | -1.314 | 3.330 |
| ZIPCode_County_Sonoma County | 1.5833 | 1.249 | 1.267 | 0.205 | -0.865 | 4.032 |
| ZIPCode_County_Stanislaus County | -24.0800 | 1.91e+05 | -0.000 | 1.000 | -3.75e+05 | 3.75e+05 |
| ZIPCode_County_Trinity County | -37.5923 | 4.85e+08 | -7.75e-08 | 1.000 | -9.5e+08 | 9.5e+08 |
| ZIPCode_County_Tuolumne County | -21.4258 | 2.51e+05 | -8.54e-05 | 1.000 | -4.92e+05 | 4.92e+05 |
| ZIPCode_County_Unknown | 0.7671 | 1.160 | 0.661 | 0.508 | -1.507 | 3.041 |
| ZIPCode_County_Ventura County | 0.1525 | 0.699 | 0.218 | 0.827 | -1.217 | 1.522 |
| ZIPCode_County_Yolo County | -0.2650 | 0.781 | -0.339 | 0.734 | -1.796 | 1.266 |
X_train2,X_test2, y_train, y_test = split('Personal_Loan','Age','Mortgage','ZIPCode_County')
logit2 = sm.Logit(y_train, X_train2.astype(float))
lg2 = logit2.fit(warn_convergence =False)
# Let's check model performances for this model
scores_LR = get_metrics_score(lg2,'stats',X_train2,X_test2,y_train,y_test)
Optimization terminated successfully.
Current function value: 0.099152
Iterations 10
Accuracy on training set : 0.9682857142857143
Accuracy on test set : 0.9613333333333334
Recall on training set : 0.7462235649546828
Recall on test set : 0.6711409395973155
Precision on training set : 0.9014598540145985
Precision on test set : 0.9174311926605505
F1 on training set : 0.8165289256198347
F1 on test set : 0.7751937984496124
lg2.summary()
| Dep. Variable: | Personal_Loan | No. Observations: | 3500 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 3487 |
| Method: | MLE | Df Model: | 12 |
| Date: | Sat, 17 Jul 2021 | Pseudo R-squ.: | 0.6832 |
| Time: | 05:20:18 | Log-Likelihood: | -347.03 |
| converged: | True | LL-Null: | -1095.5 |
| Covariance Type: | nonrobust | LLR p-value: | 0.000 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -14.7984 | 0.851 | -17.397 | 0.000 | -16.466 | -13.131 |
| Experience | 0.0076 | 0.009 | 0.861 | 0.389 | -0.010 | 0.025 |
| Income | 0.0666 | 0.004 | 15.870 | 0.000 | 0.058 | 0.075 |
| CCAvg | 0.5242 | 0.080 | 6.587 | 0.000 | 0.368 | 0.680 |
| Family_2 | 0.0840 | 0.296 | 0.283 | 0.777 | -0.497 | 0.665 |
| Family_3 | 2.7706 | 0.337 | 8.215 | 0.000 | 2.110 | 3.432 |
| Family_4 | 1.7950 | 0.326 | 5.511 | 0.000 | 1.157 | 2.433 |
| Education_Graduate | 4.2129 | 0.363 | 11.619 | 0.000 | 3.502 | 4.924 |
| Education_Advanced/Professional | 4.4808 | 0.362 | 12.372 | 0.000 | 3.771 | 5.191 |
| Securities_Account_1 | -1.0026 | 0.424 | -2.366 | 0.018 | -1.833 | -0.172 |
| CD_Account_1 | 3.6498 | 0.458 | 7.974 | 0.000 | 2.753 | 4.547 |
| Online_1 | -0.5852 | 0.214 | -2.736 | 0.006 | -1.004 | -0.166 |
| CreditCard_1 | -0.9683 | 0.278 | -3.481 | 0.001 | -1.514 | -0.423 |
X_train3,X_test3, y_train, y_test = split('Personal_Loan','Experience','Mortgage','ZIPCode_County')
logit3 = sm.Logit(y_train, X_train3.astype(float))
lg3 = logit3.fit(warn_convergence =False)
# Let's check model performances for this model
scores_LR = get_metrics_score(lg3,'stats',X_train3,X_test3,y_train,y_test)
Optimization terminated successfully.
Current function value: 0.099154
Iterations 10
Accuracy on training set : 0.9682857142857143
Accuracy on test set : 0.9613333333333334
Recall on training set : 0.7462235649546828
Recall on test set : 0.6711409395973155
Precision on training set : 0.9014598540145985
Precision on test set : 0.9174311926605505
F1 on training set : 0.8165289256198347
F1 on test set : 0.7751937984496124
lg3.summary()
| Dep. Variable: | Personal_Loan | No. Observations: | 3500 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 3487 |
| Method: | MLE | Df Model: | 12 |
| Date: | Sat, 17 Jul 2021 | Pseudo R-squ.: | 0.6832 |
| Time: | 05:20:18 | Log-Likelihood: | -347.04 |
| converged: | True | LL-Null: | -1095.5 |
| Covariance Type: | nonrobust | LLR p-value: | 0.000 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -14.9901 | 0.939 | -15.956 | 0.000 | -16.831 | -13.149 |
| Age | 0.0075 | 0.009 | 0.853 | 0.394 | -0.010 | 0.025 |
| Income | 0.0666 | 0.004 | 15.867 | 0.000 | 0.058 | 0.075 |
| CCAvg | 0.5239 | 0.080 | 6.588 | 0.000 | 0.368 | 0.680 |
| Family_2 | 0.0839 | 0.296 | 0.283 | 0.777 | -0.497 | 0.665 |
| Family_3 | 2.7706 | 0.337 | 8.214 | 0.000 | 2.110 | 3.432 |
| Family_4 | 1.7954 | 0.326 | 5.512 | 0.000 | 1.157 | 2.434 |
| Education_Graduate | 4.2117 | 0.363 | 11.615 | 0.000 | 3.501 | 4.922 |
| Education_Advanced/Professional | 4.4769 | 0.362 | 12.366 | 0.000 | 3.767 | 5.186 |
| Securities_Account_1 | -1.0019 | 0.424 | -2.364 | 0.018 | -1.833 | -0.171 |
| CD_Account_1 | 3.6506 | 0.458 | 7.975 | 0.000 | 2.753 | 4.548 |
| Online_1 | -0.5850 | 0.214 | -2.735 | 0.006 | -1.004 | -0.166 |
| CreditCard_1 | -0.9678 | 0.278 | -3.479 | 0.001 | -1.513 | -0.423 |
X_train4,X_test4, y_train, y_test = split('Personal_Loan','Experience','Mortgage','ZIPCode_County','Age')
logit4 = sm.Logit(y_train, X_train4.astype(float))
lg4 = logit4.fit(warn_convergence =False)
# Let's check model performances for this model
scores_LR = get_metrics_score(lg4,'stats',X_train4,X_test4,y_train,y_test)
Optimization terminated successfully.
Current function value: 0.099258
Iterations 10
Accuracy on training set : 0.9691428571428572
Accuracy on test set : 0.9586666666666667
Recall on training set : 0.7371601208459214
Recall on test set : 0.6442953020134228
Precision on training set : 0.9207547169811321
Precision on test set : 0.9142857142857143
F1 on training set : 0.8187919463087249
F1 on test set : 0.7559055118110237
lg4.summary()
| Dep. Variable: | Personal_Loan | No. Observations: | 3500 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 3488 |
| Method: | MLE | Df Model: | 11 |
| Date: | Sat, 17 Jul 2021 | Pseudo R-squ.: | 0.6829 |
| Time: | 05:20:19 | Log-Likelihood: | -347.40 |
| converged: | True | LL-Null: | -1095.5 |
| Covariance Type: | nonrobust | LLR p-value: | 0.000 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -14.6094 | 0.817 | -17.882 | 0.000 | -16.211 | -13.008 |
| Income | 0.0664 | 0.004 | 15.869 | 0.000 | 0.058 | 0.075 |
| CCAvg | 0.5182 | 0.079 | 6.531 | 0.000 | 0.363 | 0.674 |
| Family_2 | 0.0830 | 0.296 | 0.280 | 0.779 | -0.497 | 0.663 |
| Family_3 | 2.7636 | 0.337 | 8.193 | 0.000 | 2.102 | 3.425 |
| Family_4 | 1.7845 | 0.325 | 5.483 | 0.000 | 1.147 | 2.422 |
| Education_Graduate | 4.2119 | 0.363 | 11.613 | 0.000 | 3.501 | 4.923 |
| Education_Advanced/Professional | 4.4714 | 0.362 | 12.359 | 0.000 | 3.762 | 5.180 |
| Securities_Account_1 | -1.0098 | 0.425 | -2.375 | 0.018 | -1.843 | -0.176 |
| CD_Account_1 | 3.6761 | 0.458 | 8.023 | 0.000 | 2.778 | 4.574 |
| Online_1 | -0.5891 | 0.214 | -2.757 | 0.006 | -1.008 | -0.170 |
| CreditCard_1 | -0.9725 | 0.278 | -3.497 | 0.000 | -1.518 | -0.427 |
X_train5,X_test5, y_train, y_test = split('Personal_Loan','Age','Mortgage','ZIPCode_County','Experience')
X_train5.drop('Family_2', axis = 1, inplace = True)
X_test5.drop('Family_2', axis = 1, inplace = True)
logit5 = sm.Logit(y_train, X_train5.astype(float))
lg5 = logit5.fit(warn_convergence =False)
# Let's check model performances for this model
scores_LR = get_metrics_score(lg5,'stats',X_train5,X_test5,y_train,y_test)
Optimization terminated successfully.
Current function value: 0.099269
Iterations 10
Accuracy on training set : 0.9691428571428572
Accuracy on test set : 0.96
Recall on training set : 0.7401812688821753
Recall on test set : 0.6577181208053692
Precision on training set : 0.9176029962546817
Precision on test set : 0.9158878504672897
F1 on training set : 0.8193979933110368
F1 on test set : 0.765625
lg5.summary()
| Dep. Variable: | Personal_Loan | No. Observations: | 3500 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 3489 |
| Method: | MLE | Df Model: | 10 |
| Date: | Sat, 17 Jul 2021 | Pseudo R-squ.: | 0.6828 |
| Time: | 05:20:19 | Log-Likelihood: | -347.44 |
| converged: | True | LL-Null: | -1095.5 |
| Covariance Type: | nonrobust | LLR p-value: | 0.000 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -14.5669 | 0.801 | -18.180 | 0.000 | -16.137 | -12.996 |
| Income | 0.0664 | 0.004 | 15.879 | 0.000 | 0.058 | 0.075 |
| CCAvg | 0.5200 | 0.079 | 6.571 | 0.000 | 0.365 | 0.675 |
| Family_3 | 2.7209 | 0.300 | 9.071 | 0.000 | 2.133 | 3.309 |
| Family_4 | 1.7414 | 0.286 | 6.093 | 0.000 | 1.181 | 2.302 |
| Education_Graduate | 4.2095 | 0.363 | 11.611 | 0.000 | 3.499 | 4.920 |
| Education_Advanced/Professional | 4.4698 | 0.362 | 12.358 | 0.000 | 3.761 | 5.179 |
| Securities_Account_1 | -1.0103 | 0.425 | -2.375 | 0.018 | -1.844 | -0.176 |
| CD_Account_1 | 3.6677 | 0.457 | 8.022 | 0.000 | 2.772 | 4.564 |
| Online_1 | -0.5874 | 0.214 | -2.750 | 0.006 | -1.006 | -0.169 |
| CreditCard_1 | -0.9696 | 0.278 | -3.490 | 0.000 | -1.514 | -0.425 |
#confusion matrix
make_confusion_matrix(lg5,'stats',X_test5,y_test)
# metrics
scores_LR = get_metrics_score(lg5,'stats',X_train5,X_test5,y_train,y_test)
Accuracy on training set : 0.9691428571428572 Accuracy on test set : 0.96 Recall on training set : 0.7401812688821753 Recall on test set : 0.6577181208053692 Precision on training set : 0.9176029962546817 Precision on test set : 0.9158878504672897 F1 on training set : 0.8193979933110368 F1 on test set : 0.765625
logit_roc_auc_train = roc_auc_score(y_train, lg5.predict(X_train5))
fpr, tpr, thresholds = roc_curve(y_train, lg5.predict(X_train5))
plt.figure(figsize=(7,5))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc_train)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
logit_roc_auc_test = roc_auc_score(y_test, lg5.predict(X_test5))
fpr, tpr, thresholds = roc_curve(y_test, lg5.predict(X_test5))
plt.figure(figsize=(7,5))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc_test)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
Coefficient of some levels of Income, CCAvg, Family size, Education, and CD_Account are positive and an increase in these will lead to increase in chances of a customer buying a Personal Loan.
Coefficient of Secuities_Account, Online_account and CreditCard are negative, increase in these will lead to decrease in chances of a customer buying a Personal Loan.
The coefficients of the logistic regression model are in terms of log(odd), to find the odds we have to take the exponential of the coefficients.
Therefore, odds = exp(b)
The percentage change in odds is given as odds = (exp(b) - 1) * 100
odds = np.exp(lg5.params) # converting coefficients to odds
pd.set_option('display.max_columns',None) # removing limit from number of columns to display
pd.DataFrame(odds, X_train5.columns, columns=['odds']) # adding the odds to a dataframe
| odds | |
|---|---|
| const | 4.717156e-07 |
| Income | 1.068642e+00 |
| CCAvg | 1.682016e+00 |
| Family_3 | 1.519469e+01 |
| Family_4 | 5.705258e+00 |
| Education_Graduate | 6.732155e+01 |
| Education_Advanced/Professional | 8.733981e+01 |
| Securities_Account_1 | 3.641269e-01 |
| CD_Account_1 | 3.916144e+01 |
| Online_1 | 5.557446e-01 |
| CreditCard_1 | 3.792363e-01 |
perc_change_odds = (np.exp(lg5.params)-1)*100 # finding the percentage change
pd.set_option('display.max_columns',None) # removing limit from number of columns to display
pd.DataFrame(perc_change_odds, X_train5.columns, columns=['change_odds%']).T # adding the change_odds% to a dataframe
| const | Income | CCAvg | Family_3 | Family_4 | Education_Graduate | Education_Advanced/Professional | Securities_Account_1 | CD_Account_1 | Online_1 | CreditCard_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| change_odds% | -99.999953 | 6.864234 | 68.201623 | 1419.468717 | 470.525813 | 6632.15484 | 8633.980901 | -63.587306 | 3816.143773 | -44.425539 | -62.076368 |
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = metrics.roc_curve(y_test, lg5.predict(X_test5))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.09953954146461887
#confusion matrix with optimal threshold = 0.099
make_confusion_matrix(lg5,'stats',X_test5,y_test,threshold=optimal_threshold_auc_roc)
# checking model performance
scores_LR = get_metrics_score(lg5,'stats',X_train5,X_test5,y_train,y_test,threshold=optimal_threshold_auc_roc,roc=True)
Accuracy on training set : 0.918 Accuracy on test set : 0.92 Recall on training set : 0.8972809667673716 Recall on test set : 0.8590604026845637 Precision on training set : 0.54 Precision on test set : 0.5638766519823789 F1 on training set : 0.674233825198638 F1 on test set : 0.6808510638297872 ROC-AUC Score on training set : 0.9087225281927738 ROC-AUC Score on test set : 0.89289067506545
y_scores=lg5.predict(X_train5)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], 'b--', label='precision')
plt.plot(thresholds, recalls[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.xlim([0, 1])
plt.ylim([0,1])
plt.figure(figsize=(10,7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
optimal_threshold_curve = 0.35
#confusion matrix with optimal threshold = 0.35
make_confusion_matrix(lg5,'stats',X_test5,y_test,threshold=optimal_threshold_curve)
# checking model performance
scores_LR = get_metrics_score(lg5,'stats',X_train5,X_test5,y_train,y_test,threshold=optimal_threshold_curve,roc=True)
Accuracy on training set : 0.9628571428571429 Accuracy on test set : 0.958 Recall on training set : 0.7945619335347432 Recall on test set : 0.7248322147651006 Precision on training set : 0.8092307692307692 Precision on test set : 0.8307692307692308 F1 on training set : 0.801829268292683 F1 on test set : 0.7741935483870969 ROC-AUC Score on training set : 0.8874987010684129 ROC-AUC Score on test set : 0.8542739904321431
# defining list of model
models = [lg5]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []
# looping through the models list to get the metrics score - Accuracy, Recall, Precision, and F1 score
for model in models:
j = get_metrics_score(model,'stats',X_train5,X_test5,y_train,y_test,flag=False)
k = get_metrics_score(model,'stats',X_train5,X_test5,y_train,y_test,threshold=optimal_threshold_auc_roc,flag=False)
l = get_metrics_score(model,'stats',X_train5,X_test5,y_train,y_test,threshold=optimal_threshold_curve,flag=False)
# intial logistic regression model
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
f1_train.append(j[6])
f1_test.append(j[7])
# logistic regression with threshold = 0.09
acc_train.append(k[0])
acc_test.append(k[1])
recall_train.append(k[2])
recall_test.append(k[3])
precision_train.append(k[4])
precision_test.append(k[5])
f1_train.append(k[6])
f1_test.append(k[7])
# logistic regression with threshold = 0.35
acc_train.append(l[0])
acc_test.append(l[1])
recall_train.append(l[2])
recall_test.append(l[3])
precision_train.append(l[4])
precision_test.append(l[5])
f1_train.append(l[6])
f1_test.append(l[7])
comparison_frame = pd.DataFrame({'Model':['Logistic Regression Model - Statsmodels',
'Logistic Regression - Optimal threshold = 0 .09',
'Logistic Regression - Optimal threshold = 0 .35'
],
'Train_Accuracy':acc_train,
'Test_Accuracy':acc_test,
'Train Recall':recall_train,
'Test Recall':recall_test,
'Train Precision':precision_train,
'Test Precision':precision_test,
'Train F1':f1_train,
'Test F1':f1_test
})
comparison_frame
| Model | Train_Accuracy | Test_Accuracy | Train Recall | Test Recall | Train Precision | Test Precision | Train F1 | Test F1 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression Model - Statsmodels | 0.969143 | 0.960 | 0.740181 | 0.657718 | 0.917603 | 0.915888 | 0.819398 | 0.765625 |
| 1 | Logistic Regression - Optimal threshold = 0 .09 | 0.918000 | 0.920 | 0.897281 | 0.859060 | 0.540000 | 0.563877 | 0.674234 | 0.680851 |
| 2 | Logistic Regression - Optimal threshold = 0 .35 | 0.962857 | 0.958 | 0.794562 | 0.724832 | 0.809231 | 0.830769 | 0.801829 | 0.774194 |
We have been able to build a predictive model that can be used by the bank to find the customers who will buy a Personal Loan with an f1_score of 0.80 on the training set and 0.77 on test data. (Logistic Regression - Optimal threshold = 0.35 - with significant predictors).
Coefficient of some levels of are Income, Family size, CCAvg, Education, CD_Account are positive, an increase in these will lead to increase in chances of a customer buying the loan.
Coefficient of Securities account, Online and CreditCard are negative, increase in these will lead to decrease in chances of a customer buying a Personal Loan.
We will build our model using the DecisionTreeClassifier function. Using default 'gini' criteria to split.
If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.
In this case, we can pass a dictionary {0:0.10,1:0.90} to the model to specify the weight of each class and the decision tree will give more weightage to class 1.
class_weight is a hyperparameter for the decision tree classifier.
X_train,X_test, y_train, y_test = split('Personal_Loan')
model_dt = DecisionTreeClassifier(criterion='gini',class_weight={0:0.10,1:0.90},random_state=1)
model_dt.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.1, 1: 0.9}, random_state=1)
def make_confusion_matrix_dt(model,y_actual,labels=[0,1]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix(y_actual, y_predict, labels=[0,1])
df_cm1 = pd.DataFrame(cm, index = [i for i in ['Actual No','Actual Yes']],
columns = [i for i in ['Predicted No','Predicted Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm1, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
make_confusion_matrix_dt(model_dt,y_test)
y_train.value_counts()
Personal_Loan 0 3169 1 331 dtype: int64
Reality: A customer buys a loan.
Model predicted: The liability customer will get converted to a loan customer buying a loan.
Outcome: The model is good.
Reality: A customer did NOT buy a loan.
Model predicted: The liability customer will NOT get converted to loan customer.
Outcome: The business is unaffected.
Reality: A customer did NOT buy a loan.
Model predicted: The customer will get converted to a loan customer buying a loan.
Outcome: The team which is targeting the potential customers will be wasting their resources on the people/customers which will not be a very big loss compared to losing a customer who will buy a loan.
Reality: A customer buys a loan.
Model predicted: The customer will NOT buy a loan.
Outcome: The potential customer is missed by the sales/marketing team, the team could have offered the potential customer some discount or loyalty card to make the customer come again to purchase. (Customer retention will get affected.)
## Function to calculate recall score
def get_recall_score(model):
'''
model : classifier to predict values of X
'''
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
get_recall_score(model_dt)
Recall on training set : 1.0 Recall on test set : 0.8523489932885906
column_names = list(X_train.columns)# Keep only names of features by removing the name of target variable
feature_names = column_names
print(feature_names)
['const', 'Age', 'Experience', 'Income', 'CCAvg', 'Mortgage', 'Family_2', 'Family_3', 'Family_4', 'Education_Graduate', 'Education_Advanced/Professional', 'Securities_Account_1', 'CD_Account_1', 'Online_1', 'CreditCard_1', 'ZIPCode_County_Butte County', 'ZIPCode_County_Contra Costa County', 'ZIPCode_County_El Dorado County', 'ZIPCode_County_Fresno County', 'ZIPCode_County_Humboldt County', 'ZIPCode_County_Imperial County', 'ZIPCode_County_Kern County', 'ZIPCode_County_Lake County', 'ZIPCode_County_Los Angeles County', 'ZIPCode_County_Marin County', 'ZIPCode_County_Mendocino County', 'ZIPCode_County_Merced County', 'ZIPCode_County_Monterey County', 'ZIPCode_County_Napa County', 'ZIPCode_County_Orange County', 'ZIPCode_County_Placer County', 'ZIPCode_County_Riverside County', 'ZIPCode_County_Sacramento County', 'ZIPCode_County_San Benito County', 'ZIPCode_County_San Bernardino County', 'ZIPCode_County_San Diego County', 'ZIPCode_County_San Francisco County', 'ZIPCode_County_San Joaquin County', 'ZIPCode_County_San Luis Obispo County', 'ZIPCode_County_San Mateo County', 'ZIPCode_County_Santa Barbara County', 'ZIPCode_County_Santa Clara County', 'ZIPCode_County_Santa Cruz County', 'ZIPCode_County_Shasta County', 'ZIPCode_County_Siskiyou County', 'ZIPCode_County_Solano County', 'ZIPCode_County_Sonoma County', 'ZIPCode_County_Stanislaus County', 'ZIPCode_County_Trinity County', 'ZIPCode_County_Tuolumne County', 'ZIPCode_County_Unknown', 'ZIPCode_County_Ventura County', 'ZIPCode_County_Yolo County']
plt.figure(figsize=(20,30))
out = tree.plot_tree(model_dt,feature_names=feature_names,filled=True,fontsize=9,node_ids=False,class_names=None,)
#below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model_dt,feature_names=feature_names,show_weights=True))
|--- Income <= 92.50 | |--- CCAvg <= 2.95 | | |--- ZIPCode_County_Solano County <= 0.50 | | | |--- weights: [241.50, 0.00] class: 0 | | |--- ZIPCode_County_Solano County > 0.50 | | | |--- weights: [2.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account_1 <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- Mortgage <= 102.50 | | | | | |--- CCAvg <= 3.05 | | | | | | |--- weights: [1.50, 0.00] class: 0 | | | | | |--- CCAvg > 3.05 | | | | | | |--- Family_4 <= 0.50 | | | | | | | |--- Income <= 67.00 | | | | | | | | |--- weights: [0.80, 0.00] class: 0 | | | | | | | |--- Income > 67.00 | | | | | | | | |--- Securities_Account_1 <= 0.50 | | | | | | | | | |--- ZIPCode_County_Contra Costa County <= 0.50 | | | | | | | | | | |--- Income <= 84.00 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- Income > 84.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- ZIPCode_County_Contra Costa County > 0.50 | | | | | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | | | | | | |--- Securities_Account_1 > 0.50 | | | | | | | | | |--- weights: [0.40, 0.00] class: 0 | | | | | | |--- Family_4 > 0.50 | | | | | | | |--- weights: [1.30, 0.00] class: 0 | | | | |--- Mortgage > 102.50 | | | | | |--- Education_Graduate <= 0.50 | | | | | | |--- weights: [1.90, 0.00] class: 0 | | | | | |--- Education_Graduate > 0.50 | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | |--- CCAvg > 3.95 | | | | |--- Age <= 29.50 | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | |--- Age > 29.50 | | | | | |--- weights: [4.10, 0.00] class: 0 | | |--- CD_Account_1 > 0.50 | | | |--- weights: [0.00, 4.50] class: 1 |--- Income > 92.50 | |--- Education_Advanced/Professional <= 0.50 | | |--- Education_Graduate <= 0.50 | | | |--- Family_3 <= 0.50 | | | | |--- Family_4 <= 0.50 | | | | | |--- Income <= 103.50 | | | | | | |--- CCAvg <= 3.21 | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | |--- CCAvg > 3.21 | | | | | | | |--- ZIPCode_County_Los Angeles County <= 0.50 | | | | | | | | |--- weights: [0.40, 0.00] class: 0 | | | | | | | |--- ZIPCode_County_Los Angeles County > 0.50 | | | | | | | | |--- Income <= 96.00 | | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | | | |--- Income > 96.00 | | | | | | | | | |--- weights: [0.00, 2.70] class: 1 | | | | | |--- Income > 103.50 | | | | | | |--- weights: [43.30, 0.00] class: 0 | | | | |--- Family_4 > 0.50 | | | | | |--- Income <= 93.50 | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | |--- Income > 93.50 | | | | | | |--- Income <= 102.00 | | | | | | | |--- CCAvg <= 4.50 | | | | | | | | |--- weights: [0.00, 0.90] class: 1 | | | | | | | |--- CCAvg > 4.50 | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | |--- Income > 102.00 | | | | | | | |--- weights: [0.00, 17.10] class: 1 | | | |--- Family_3 > 0.50 | | | | |--- Income <= 108.50 | | | | | |--- weights: [1.10, 0.00] class: 0 | | | | |--- Income > 108.50 | | | | | |--- Age <= 26.00 | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | |--- Age > 26.00 | | | | | | |--- Income <= 118.00 | | | | | | | |--- Online_1 <= 0.50 | | | | | | | | |--- weights: [0.00, 1.80] class: 1 | | | | | | | |--- Online_1 > 0.50 | | | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | | | | |--- Income > 118.00 | | | | | | | |--- weights: [0.00, 29.70] class: 1 | | |--- Education_Graduate > 0.50 | | | |--- Income <= 110.50 | | | | |--- CCAvg <= 2.90 | | | | | |--- ZIPCode_County_San Francisco County <= 0.50 | | | | | | |--- weights: [4.30, 0.00] class: 0 | | | | | |--- ZIPCode_County_San Francisco County > 0.50 | | | | | | |--- Online_1 <= 0.50 | | | | | | | |--- CCAvg <= 1.40 | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | | |--- CCAvg > 1.40 | | | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | | | | |--- Online_1 > 0.50 | | | | | | | |--- Age <= 57.00 | | | | | | | | |--- weights: [0.00, 0.90] class: 1 | | | | | | | |--- Age > 57.00 | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | |--- CCAvg > 2.90 | | | | | |--- Age <= 55.00 | | | | | | |--- ZIPCode_County_San Diego County <= 0.50 | | | | | | | |--- weights: [0.00, 4.50] class: 1 | | | | | | |--- ZIPCode_County_San Diego County > 0.50 | | | | | | | |--- weights: [0.00, 0.90] class: 1 | | | | | |--- Age > 55.00 | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | |--- Income > 110.50 | | | | |--- Income <= 116.50 | | | | | |--- Mortgage <= 126.25 | | | | | | |--- Age <= 60.50 | | | | | | | |--- CCAvg <= 1.20 | | | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | | | | | |--- CCAvg > 1.20 | | | | | | | | |--- ZIPCode_County_Santa Clara County <= 0.50 | | | | | | | | | |--- CCAvg <= 2.65 | | | | | | | | | | |--- Income <= 113.50 | | | | | | | | | | | |--- weights: [0.00, 1.80] class: 1 | | | | | | | | | | |--- Income > 113.50 | | | | | | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | | | | | | | |--- CCAvg > 2.65 | | | | | | | | | | |--- weights: [0.00, 4.50] class: 1 | | | | | | | | |--- ZIPCode_County_Santa Clara County > 0.50 | | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | |--- Age > 60.50 | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | |--- Mortgage > 126.25 | | | | | | |--- weights: [0.40, 0.00] class: 0 | | | | |--- Income > 116.50 | | | | | |--- weights: [0.00, 97.20] class: 1 | |--- Education_Advanced/Professional > 0.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 2.35 | | | | |--- Mortgage <= 236.00 | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | |--- Mortgage > 236.00 | | | | | |--- CCAvg <= 1.25 | | | | | | |--- Experience <= 14.00 | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | |--- Experience > 14.00 | | | | | | | |--- weights: [0.00, 1.80] class: 1 | | | | | |--- CCAvg > 1.25 | | | | | | |--- ZIPCode_County_Los Angeles County <= 0.50 | | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | | | | |--- ZIPCode_County_Los Angeles County > 0.50 | | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | |--- CCAvg > 2.35 | | | | |--- Age <= 64.00 | | | | | |--- ZIPCode_County_Santa Barbara County <= 0.50 | | | | | | |--- CCAvg <= 2.95 | | | | | | | |--- ZIPCode_County_San Diego County <= 0.50 | | | | | | | | |--- CreditCard_1 <= 0.50 | | | | | | | | | |--- weights: [0.50, 0.00] class: 0 | | | | | | | | |--- CreditCard_1 > 0.50 | | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | | |--- ZIPCode_County_San Diego County > 0.50 | | | | | | | | |--- weights: [0.00, 1.80] class: 1 | | | | | | |--- CCAvg > 2.95 | | | | | | | |--- ZIPCode_County_San Diego County <= 0.50 | | | | | | | | |--- ZIPCode_County_San Bernardino County <= 0.50 | | | | | | | | | |--- Mortgage <= 172.00 | | | | | | | | | | |--- CD_Account_1 <= 0.50 | | | | | | | | | | | |--- weights: [0.00, 12.60] class: 1 | | | | | | | | | | |--- CD_Account_1 > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- Mortgage > 172.00 | | | | | | | | | | |--- Mortgage <= 199.00 | | | | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | | | | | |--- Mortgage > 199.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- ZIPCode_County_San Bernardino County > 0.50 | | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | | |--- ZIPCode_County_San Diego County > 0.50 | | | | | | | | |--- Family_2 <= 0.50 | | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | | | |--- Family_2 > 0.50 | | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | |--- ZIPCode_County_Santa Barbara County > 0.50 | | | | | | |--- Online_1 <= 0.50 | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | |--- Online_1 > 0.50 | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | |--- Age > 64.00 | | | | | |--- ZIPCode_County_Yolo County <= 0.50 | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | | | |--- ZIPCode_County_Yolo County > 0.50 | | | | | | |--- weights: [0.10, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- weights: [0.00, 102.60] class: 1
importances = model_dt.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
from sklearn.model_selection import GridSearchCV
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1,class_weight = {0:.10,1:.90})
# Grid of parameters to choose from
parameters = {
'max_depth': np.arange(1,10),
'criterion': ['entropy','gini'],
'splitter': ['best','random'],
'min_impurity_decrease': [0.000001,0.00001,0.0001],
'max_features': ['log2','sqrt']
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.1, 1: 0.9}, max_depth=8,
max_features='log2', min_impurity_decrease=0.0001,
random_state=1)
make_confusion_matrix_dt(estimator,y_test)
get_recall_score(estimator)
Recall on training set : 0.8731117824773413 Recall on test set : 0.738255033557047
plt.figure(figsize=(15,10))
out = tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=False,class_names=None)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator,feature_names=feature_names,show_weights=True))
|--- ZIPCode_County_Butte County <= 0.50 | |--- CD_Account_1 <= 0.50 | | |--- Mortgage <= 248.50 | | | |--- Mortgage <= 76.50 | | | | |--- CCAvg <= 2.85 | | | | | |--- ZIPCode_County_Ventura County <= 0.50 | | | | | | |--- ZIPCode_County_San Diego County <= 0.50 | | | | | | | |--- ZIPCode_County_Santa Clara County <= 0.50 | | | | | | | | |--- weights: [138.60, 23.40] class: 0 | | | | | | | |--- ZIPCode_County_Santa Clara County > 0.50 | | | | | | | | |--- weights: [21.20, 7.20] class: 0 | | | | | | |--- ZIPCode_County_San Diego County > 0.50 | | | | | | | |--- Income <= 106.00 | | | | | | | | |--- weights: [18.20, 0.00] class: 0 | | | | | | | |--- Income > 106.00 | | | | | | | | |--- weights: [1.70, 5.40] class: 1 | | | | | |--- ZIPCode_County_Ventura County > 0.50 | | | | | | |--- weights: [3.80, 0.00] class: 0 | | | | |--- CCAvg > 2.85 | | | | | |--- Family_4 <= 0.50 | | | | | | |--- ZIPCode_County_Shasta County <= 0.50 | | | | | | | |--- Securities_Account_1 <= 0.50 | | | | | | | | |--- weights: [23.10, 74.70] class: 1 | | | | | | | |--- Securities_Account_1 > 0.50 | | | | | | | | |--- weights: [2.60, 0.90] class: 0 | | | | | | |--- ZIPCode_County_Shasta County > 0.50 | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | |--- Family_4 > 0.50 | | | | | | |--- Online_1 <= 0.50 | | | | | | | |--- weights: [1.30, 16.20] class: 1 | | | | | | |--- Online_1 > 0.50 | | | | | | | |--- Age <= 46.50 | | | | | | | | |--- weights: [0.80, 9.90] class: 1 | | | | | | | |--- Age > 46.50 | | | | | | | | |--- weights: [1.40, 3.60] class: 1 | | | |--- Mortgage > 76.50 | | | | |--- Family_2 <= 0.50 | | | | | |--- ZIPCode_County_San Mateo County <= 0.50 | | | | | | |--- CCAvg <= 2.75 | | | | | | | |--- ZIPCode_County_Orange County <= 0.50 | | | | | | | | |--- weights: [45.30, 5.40] class: 0 | | | | | | | |--- ZIPCode_County_Orange County > 0.50 | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | |--- CCAvg > 2.75 | | | | | | | |--- Securities_Account_1 <= 0.50 | | | | | | | | |--- weights: [3.40, 18.00] class: 1 | | | | | | | |--- Securities_Account_1 > 0.50 | | | | | | | | |--- weights: [0.60, 0.90] class: 1 | | | | | |--- ZIPCode_County_San Mateo County > 0.50 | | | | | | |--- weights: [3.30, 0.00] class: 0 | | | | |--- Family_2 > 0.50 | | | | | |--- Income <= 125.50 | | | | | | |--- weights: [19.20, 0.00] class: 0 | | | | | |--- Income > 125.50 | | | | | | |--- Securities_Account_1 <= 0.50 | | | | | | | |--- ZIPCode_County_Sacramento County <= 0.50 | | | | | | | | |--- weights: [1.60, 3.60] class: 1 | | | | | | | |--- ZIPCode_County_Sacramento County > 0.50 | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | |--- Securities_Account_1 > 0.50 | | | | | | | |--- weights: [0.20, 0.00] class: 0 | | |--- Mortgage > 248.50 | | | |--- Income <= 98.50 | | | | |--- weights: [5.90, 0.00] class: 0 | | | |--- Income > 98.50 | | | | |--- Family_3 <= 0.50 | | | | | |--- CreditCard_1 <= 0.50 | | | | | | |--- Family_4 <= 0.50 | | | | | | | |--- Securities_Account_1 <= 0.50 | | | | | | | | |--- weights: [4.60, 13.50] class: 1 | | | | | | | |--- Securities_Account_1 > 0.50 | | | | | | | | |--- weights: [0.50, 0.00] class: 0 | | | | | | |--- Family_4 > 0.50 | | | | | | | |--- weights: [0.20, 6.30] class: 1 | | | | | |--- CreditCard_1 > 0.50 | | | | | | |--- Online_1 <= 0.50 | | | | | | | |--- Income <= 169.00 | | | | | | | | |--- weights: [0.50, 1.80] class: 1 | | | | | | | |--- Income > 169.00 | | | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | | | | |--- Online_1 > 0.50 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- Family_3 > 0.50 | | | | | |--- CCAvg <= 2.55 | | | | | | |--- ZIPCode_County_Los Angeles County <= 0.50 | | | | | | | |--- ZIPCode_County_Orange County <= 0.50 | | | | | | | | |--- weights: [0.40, 0.90] class: 1 | | | | | | | |--- ZIPCode_County_Orange County > 0.50 | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | |--- ZIPCode_County_Los Angeles County > 0.50 | | | | | | | |--- weights: [0.00, 0.90] class: 1 | | | | | |--- CCAvg > 2.55 | | | | | | |--- weights: [0.10, 12.60] class: 1 | |--- CD_Account_1 > 0.50 | | |--- Securities_Account_1 <= 0.50 | | | |--- ZIPCode_County_Yolo County <= 0.50 | | | | |--- weights: [4.10, 60.30] class: 1 | | | |--- ZIPCode_County_Yolo County > 0.50 | | | | |--- weights: [0.30, 0.00] class: 0 | | |--- Securities_Account_1 > 0.50 | | | |--- CreditCard_1 <= 0.50 | | | | |--- Income <= 81.00 | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | |--- Income > 81.00 | | | | | |--- weights: [0.20, 20.70] class: 1 | | | |--- CreditCard_1 > 0.50 | | | | |--- Experience <= 19.50 | | | | | |--- Income <= 100.50 | | | | | | |--- weights: [2.10, 0.00] class: 0 | | | | | |--- Income > 100.50 | | | | | | |--- ZIPCode_County_San Francisco County <= 0.50 | | | | | | | |--- weights: [0.20, 7.20] class: 1 | | | | | | |--- ZIPCode_County_San Francisco County > 0.50 | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | |--- Experience > 19.50 | | | | | |--- ZIPCode_County_Santa Clara County <= 0.50 | | | | | | |--- ZIPCode_County_Los Angeles County <= 0.50 | | | | | | | |--- Education_Advanced/Professional <= 0.50 | | | | | | | | |--- weights: [1.90, 0.90] class: 0 | | | | | | | |--- Education_Advanced/Professional > 0.50 | | | | | | | | |--- weights: [0.60, 0.90] class: 1 | | | | | | |--- ZIPCode_County_Los Angeles County > 0.50 | | | | | | | |--- Family_2 <= 0.50 | | | | | | | | |--- weights: [0.40, 2.70] class: 1 | | | | | | | |--- Family_2 > 0.50 | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | |--- ZIPCode_County_Santa Clara County > 0.50 | | | | | | |--- weights: [0.70, 0.00] class: 0 |--- ZIPCode_County_Butte County > 0.50 | |--- weights: [1.30, 0.00] class: 0
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
clf = DecisionTreeClassifier(random_state=1,class_weight = {0:0.10,1:0.90})
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000e+00 | -1.343626e-15 |
| 1 | 1.805828e-19 | -1.343446e-15 |
| 2 | 7.223312e-19 | -1.342723e-15 |
| 3 | 7.584477e-19 | -1.341965e-15 |
| 4 | 1.011264e-18 | -1.340954e-15 |
| 5 | 1.300196e-18 | -1.339653e-15 |
| 6 | 2.925441e-18 | -1.336728e-15 |
| 7 | 8.704091e-18 | -1.328024e-15 |
| 8 | 9.426422e-18 | -1.318598e-15 |
| 9 | 1.300196e-17 | -1.305596e-15 |
| 10 | 2.275343e-17 | -1.282842e-15 |
| 11 | 2.419448e-16 | -1.040897e-15 |
| 12 | 5.144714e-15 | 4.103816e-15 |
| 13 | 1.614585e-04 | 3.229171e-04 |
| 14 | 1.617559e-04 | 6.464288e-04 |
| 15 | 2.117553e-04 | 1.281695e-03 |
| 16 | 2.927781e-04 | 1.574473e-03 |
| 17 | 3.002853e-04 | 2.475329e-03 |
| 18 | 3.054626e-04 | 3.391717e-03 |
| 19 | 3.081875e-04 | 3.699904e-03 |
| 20 | 3.105223e-04 | 4.631471e-03 |
| 21 | 3.116981e-04 | 4.943169e-03 |
| 22 | 3.136909e-04 | 5.256860e-03 |
| 23 | 3.199567e-04 | 5.576816e-03 |
| 24 | 3.222401e-04 | 6.543537e-03 |
| 25 | 5.753795e-04 | 7.118916e-03 |
| 26 | 6.080777e-04 | 7.726994e-03 |
| 27 | 6.122641e-04 | 8.339258e-03 |
| 28 | 6.273817e-04 | 8.966640e-03 |
| 29 | 7.080002e-04 | 1.109064e-02 |
| 30 | 7.811729e-04 | 1.343416e-02 |
| 31 | 7.840069e-04 | 1.578618e-02 |
| 32 | 8.273599e-04 | 1.661354e-02 |
| 33 | 9.383915e-04 | 1.755193e-02 |
| 34 | 9.647609e-04 | 1.851669e-02 |
| 35 | 1.058707e-03 | 1.957540e-02 |
| 36 | 1.556389e-03 | 2.113179e-02 |
| 37 | 1.682633e-03 | 2.281442e-02 |
| 38 | 2.208456e-03 | 2.723133e-02 |
| 39 | 2.328917e-03 | 2.956025e-02 |
| 40 | 2.909596e-03 | 3.246985e-02 |
| 41 | 3.240232e-03 | 3.571008e-02 |
| 42 | 3.393805e-03 | 3.910388e-02 |
| 43 | 3.470671e-03 | 4.604523e-02 |
| 44 | 3.841577e-03 | 4.988680e-02 |
| 45 | 4.980603e-03 | 5.984801e-02 |
| 46 | 5.881704e-03 | 6.572971e-02 |
| 47 | 5.974141e-03 | 7.170385e-02 |
| 48 | 2.132036e-02 | 9.302421e-02 |
| 49 | 2.840493e-02 | 2.066439e-01 |
| 50 | 2.928785e-01 | 4.995225e-01 |
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha,class_weight = {0:0.10,1:0.90})
clf.fit(X_train, y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
Number of nodes in the last tree is: 1 with ccp_alpha: 0.2928785401980033
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train=[]
for clf in clfs:
pred_train3=clf.predict(X_train)
values_train=metrics.recall_score(y_train,pred_train3)
recall_train.append(values_train)
recall_test=[]
for clf in clfs:
pred_test3=clf.predict(X_test)
values_test=metrics.recall_score(y_test,pred_test3)
recall_test.append(values_test)
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
drawstyle="steps-post",)
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0049806032349823705,
class_weight={0: 0.1, 1: 0.9}, random_state=1)
best_model.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.0049806032349823705,
class_weight={0: 0.1, 1: 0.9}, random_state=1)
make_confusion_matrix_dt(best_model,y_test)
get_recall_score(best_model)
Recall on training set : 0.9879154078549849 Recall on test set : 0.9798657718120806
plt.figure(figsize=(15,10))
out = tree.plot_tree(best_model,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=None)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model,feature_names=feature_names,show_weights=True))
|--- Income <= 92.50 | |--- CCAvg <= 2.95 | | |--- weights: [243.50, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- weights: [11.70, 13.50] class: 1 |--- Income > 92.50 | |--- Education_Advanced/Professional <= 0.50 | | |--- Education_Graduate <= 0.50 | | | |--- Family_3 <= 0.50 | | | | |--- Family_4 <= 0.50 | | | | | |--- weights: [47.80, 2.70] class: 0 | | | | |--- Family_4 > 0.50 | | | | | |--- weights: [0.20, 18.00] class: 1 | | | |--- Family_3 > 0.50 | | | | |--- weights: [1.40, 31.50] class: 1 | | |--- Education_Graduate > 0.50 | | | |--- Income <= 110.50 | | | | |--- CCAvg <= 2.90 | | | | | |--- weights: [4.70, 0.90] class: 0 | | | | |--- CCAvg > 2.90 | | | | | |--- weights: [0.20, 5.40] class: 1 | | | |--- Income > 110.50 | | | | |--- weights: [1.20, 103.50] class: 1 | |--- Education_Advanced/Professional > 0.50 | | |--- weights: [6.20, 122.40] class: 1
#Calculate the important features of the best_model
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
df_with_predicted = X_test.copy()
df_with_predicted['ActualLabel'] = y_test.copy()
y_predicted = best_model.predict(X_test)
df_with_predicted['PredictedLabel'] = y_predicted.copy()
df_fp = df_with_predicted[(df_with_predicted['PredictedLabel'] == 1) &
(df_with_predicted['ActualLabel'] == 0)]
print(df_fp)
df_fn = df_with_predicted[(df_with_predicted['PredictedLabel'] == 0) &
(df_with_predicted['ActualLabel'] == 1)]
print(df_fn)
const Age Experience Income CCAvg Mortgage Family_2 Family_3 \
1179 1.0 36 11.0 98.0 1.2 0.0 0 1
932 1.0 51 27.0 112.0 1.8 0.0 0 1
792 1.0 41 16.0 98.0 4.0 0.0 0 0
2982 1.0 59 33.0 111.0 4.4 0.0 0 1
3420 1.0 66 41.0 114.0 0.8 0.0 0 0
3144 1.0 43 18.0 104.0 1.0 0.0 0 1
4868 1.0 51 27.0 62.0 3.2 118.0 1 0
3741 1.0 53 29.0 51.0 3.2 0.0 1 0
3501 1.0 65 39.0 105.0 1.7 0.0 0 0
3990 1.0 57 32.0 59.0 3.7 134.0 1 0
169 1.0 27 1.0 112.0 2.1 0.0 0 0
3067 1.0 31 5.0 101.0 2.9 170.0 0 0
4065 1.0 44 19.0 68.0 3.7 0.0 0 0
3277 1.0 43 19.0 81.0 3.2 0.0 1 0
4442 1.0 48 23.0 62.0 3.6 83.0 0 0
729 1.0 58 28.0 90.0 3.0 0.0 0 0
3630 1.0 41 16.0 79.0 4.0 225.0 0 0
1941 1.0 43 19.0 58.0 3.2 0.0 1 0
836 1.0 42 17.0 74.0 3.0 0.0 0 1
2762 1.0 56 31.0 65.0 3.7 0.0 1 0
1074 1.0 39 14.0 75.0 3.0 0.0 0 1
3792 1.0 62 36.0 109.0 1.7 0.0 0 0
3708 1.0 31 1.0 74.0 4.0 0.0 0 0
12 1.0 48 23.0 114.0 3.8 0.0 1 0
717 1.0 59 34.0 94.0 0.5 0.0 0 1
222 1.0 26 2.0 104.0 2.5 0.0 0 1
3735 1.0 40 14.0 78.0 5.2 0.0 0 0
394 1.0 33 9.0 80.0 3.4 0.0 0 0
1897 1.0 54 29.0 98.0 0.1 0.0 0 0
3409 1.0 29 5.0 113.0 2.0 84.0 1 0
4392 1.0 52 27.0 81.0 3.8 0.0 0 0
4554 1.0 41 16.0 109.0 1.0 0.0 0 1
2351 1.0 55 31.0 74.0 3.2 0.0 1 0
2685 1.0 28 2.0 101.0 2.1 0.0 0 0
1875 1.0 27 3.0 112.0 2.5 252.5 0 1
3023 1.0 63 37.0 105.0 1.7 244.0 0 0
2016 1.0 41 17.0 93.0 0.8 218.0 0 0
3274 1.0 31 5.0 110.0 1.5 0.0 1 0
4327 1.0 30 4.0 102.0 2.1 139.0 0 0
2070 1.0 62 37.0 95.0 0.5 0.0 0 1
4904 1.0 64 40.0 88.0 3.8 243.0 0 0
82 1.0 41 16.0 82.0 4.0 0.0 0 0
1565 1.0 34 9.0 104.0 1.2 0.0 0 1
486 1.0 55 30.0 84.0 3.7 252.5 1 0
4229 1.0 54 24.0 83.0 3.0 0.0 0 0
4571 1.0 58 28.0 95.0 3.0 0.0 0 0
256 1.0 26 0.0 99.0 2.3 0.0 0 0
1271 1.0 28 4.0 94.0 0.8 236.0 0 1
1523 1.0 41 16.0 104.0 1.0 0.0 0 0
2854 1.0 49 24.0 79.0 3.6 212.0 0 0
1147 1.0 37 13.0 111.0 0.8 0.0 0 0
3730 1.0 30 6.0 112.0 2.5 0.0 0 1
1045 1.0 43 18.0 84.0 4.0 0.0 0 0
3818 1.0 26 0.0 102.0 2.3 0.0 0 0
420 1.0 47 22.0 58.0 3.6 0.0 0 0
3685 1.0 53 27.0 93.0 0.8 252.5 0 0
3645 1.0 42 17.0 79.0 3.7 0.0 0 0
1030 1.0 61 35.0 112.0 1.7 0.0 0 0
3402 1.0 64 40.0 95.0 0.0 0.0 1 0
3042 1.0 52 26.0 78.0 3.0 0.0 0 1
1832 1.0 54 29.0 79.0 3.8 0.0 0 0
123 1.0 37 13.0 84.0 3.6 0.0 0 0
3468 1.0 43 19.0 113.0 1.8 0.0 1 0
2900 1.0 52 28.0 55.0 3.2 151.0 1 0
4762 1.0 37 7.0 94.0 1.8 232.0 0 0
560 1.0 43 18.0 59.0 3.7 0.0 0 0
2630 1.0 63 37.0 113.0 1.7 0.0 0 0
2665 1.0 35 9.0 105.0 4.5 0.0 1 0
3650 1.0 47 21.0 93.0 0.8 107.0 1 0
3064 1.0 59 33.0 83.0 4.4 0.0 0 1
3308 1.0 48 23.0 108.0 3.8 0.0 1 0
4925 1.0 64 39.0 82.0 3.4 0.0 0 0
290 1.0 51 25.0 80.0 4.9 0.0 0 0
990 1.0 34 10.0 81.0 3.4 0.0 0 0
4816 1.0 50 24.0 83.0 3.0 0.0 0 1
3186 1.0 41 16.0 98.0 1.0 252.5 0 1
4610 1.0 37 13.0 79.0 3.6 104.0 0 0
2738 1.0 35 9.0 103.0 4.5 0.0 1 0
3683 1.0 53 27.0 62.0 3.0 0.0 0 1
2470 1.0 33 7.0 81.0 4.5 187.0 1 0
2563 1.0 39 13.0 94.0 1.5 0.0 0 0
1837 1.0 43 18.0 103.0 1.0 180.0 0 1
719 1.0 61 35.0 110.0 4.4 0.0 0 1
2625 1.0 61 36.0 108.0 3.4 0.0 0 0
3544 1.0 45 19.0 109.0 1.1 0.0 0 1
4290 1.0 66 42.0 95.0 0.0 0.0 1 0
4665 1.0 40 16.0 65.0 3.2 0.0 1 0
1401 1.0 40 15.0 84.0 3.7 0.0 0 0
829 1.0 55 30.0 81.0 3.8 0.0 0 0
3505 1.0 64 39.0 103.0 0.8 0.0 0 0
Family_4 Education_Graduate Education_Advanced/Professional \
1179 0 0 1
932 0 1 0
792 0 0 1
2982 0 0 0
3420 0 0 1
3144 0 0 0
4868 0 0 1
3741 0 0 1
3501 1 0 1
3990 0 0 0
169 1 0 1
3067 0 0 1
4065 0 0 1
3277 0 0 0
4442 1 0 1
729 0 0 1
3630 0 0 1
1941 0 0 0
836 0 0 0
2762 0 0 0
1074 0 0 0
3792 1 0 1
3708 1 0 1
12 0 0 1
717 0 0 0
222 0 0 0
3735 0 0 0
394 1 0 0
1897 0 0 1
3409 0 1 0
4392 1 1 0
4554 0 0 0
2351 0 0 1
2685 1 0 1
1875 0 0 0
3023 1 0 1
2016 1 0 0
3274 0 0 1
4327 1 0 1
2070 0 0 0
4904 0 0 0
82 0 0 1
1565 0 0 1
486 0 0 0
4229 0 0 1
4571 0 0 1
256 1 0 1
1271 0 0 0
1523 0 0 1
2854 1 0 1
1147 0 1 0
3730 0 0 0
1045 0 0 1
3818 1 0 1
420 1 0 1
3685 0 0 1
3645 0 0 1
1030 1 0 1
3402 0 0 1
3042 0 1 0
1832 1 1 0
123 0 1 0
3468 0 1 0
2900 0 0 1
4762 1 0 1
560 0 0 1
2630 1 0 1
2665 0 0 1
3650 0 0 1
3064 0 0 0
3308 0 0 1
4925 1 1 0
290 0 0 0
990 1 0 0
4816 0 1 0
3186 0 0 0
4610 0 1 0
2738 0 0 1
3683 0 1 0
2470 0 0 1
2563 0 0 1
1837 0 0 0
719 0 0 0
2625 1 1 0
3544 0 0 0
4290 0 0 1
4665 0 0 0
1401 0 0 1
829 1 1 0
3505 0 0 1
Securities_Account_1 CD_Account_1 Online_1 CreditCard_1 \
1179 1 0 0 1
932 1 1 1 1
792 0 0 0 1
2982 0 0 1 0
3420 0 0 1 1
3144 0 0 1 0
4868 0 0 0 1
3741 0 0 1 0
3501 1 0 1 0
3990 0 0 1 0
169 0 0 0 1
3067 1 0 0 0
4065 0 0 1 0
3277 0 0 1 0
4442 0 0 0 1
729 0 0 0 1
3630 0 0 1 0
1941 0 0 1 0
836 0 0 0 1
2762 0 0 1 0
1074 0 0 0 1
3792 0 0 1 0
3708 0 0 0 0
12 1 0 0 0
717 0 0 0 1
222 0 0 0 0
3735 0 0 1 0
394 0 0 1 1
1897 0 0 0 0
3409 0 0 1 1
4392 0 0 0 0
4554 1 0 1 0
2351 0 0 1 1
2685 0 0 1 0
1875 1 0 1 0
3023 0 0 0 1
2016 0 0 0 0
3274 0 0 1 0
4327 0 0 0 1
2070 0 0 0 0
4904 0 0 1 1
82 0 0 1 0
1565 0 0 1 0
486 1 0 1 0
4229 0 0 0 0
4571 0 0 0 0
256 0 0 0 1
1271 0 0 1 0
1523 0 0 1 0
2854 0 0 1 0
1147 0 0 0 0
3730 0 0 1 0
1045 0 0 0 0
3818 0 0 0 0
420 0 0 1 1
3685 0 0 0 0
3645 1 0 0 1
1030 0 0 0 1
3402 0 0 1 1
3042 0 0 0 0
1832 1 0 1 0
123 1 0 0 0
3468 0 0 0 1
2900 0 0 0 0
4762 0 0 1 0
560 0 0 1 0
2630 0 0 1 1
2665 0 0 0 0
3650 0 0 0 0
3064 0 0 1 0
3308 0 0 0 1
4925 0 0 1 0
290 0 0 0 0
990 0 0 1 0
4816 0 0 0 1
3186 0 0 0 0
4610 0 0 1 0
2738 0 0 1 0
3683 1 0 0 0
2470 0 1 1 1
2563 0 0 0 1
1837 0 0 1 1
719 1 0 1 0
2625 0 0 1 0
3544 0 0 0 0
4290 0 0 1 0
4665 0 0 1 0
1401 0 0 1 0
829 0 0 1 0
3505 0 0 1 1
ZIPCode_County_Butte County ZIPCode_County_Contra Costa County \
1179 0 0
932 0 0
792 0 0
2982 0 0
3420 0 0
3144 0 0
4868 0 0
3741 0 0
3501 0 0
3990 0 0
169 0 0
3067 0 0
4065 0 0
3277 0 0
4442 0 0
729 0 0
3630 0 0
1941 0 0
836 0 0
2762 0 0
1074 0 0
3792 0 0
3708 0 0
12 0 0
717 0 0
222 0 0
3735 0 0
394 0 0
1897 0 0
3409 0 0
4392 0 0
4554 0 1
2351 0 0
2685 0 0
1875 0 0
3023 0 0
2016 0 0
3274 0 0
4327 0 0
2070 0 0
4904 0 0
82 0 0
1565 0 0
486 0 0
4229 0 1
4571 0 0
256 0 0
1271 0 0
1523 0 0
2854 0 0
1147 0 0
3730 0 0
1045 0 0
3818 0 0
420 0 0
3685 0 0
3645 0 0
1030 0 0
3402 0 0
3042 0 0
1832 0 0
123 0 0
3468 0 0
2900 0 0
4762 0 0
560 0 0
2630 0 0
2665 0 0
3650 0 0
3064 0 0
3308 0 0
4925 0 0
290 0 0
990 0 0
4816 0 0
3186 0 0
4610 0 0
2738 0 0
3683 0 0
2470 0 0
2563 0 0
1837 0 0
719 0 0
2625 0 0
3544 0 0
4290 0 1
4665 0 0
1401 0 1
829 0 0
3505 0 0
ZIPCode_County_El Dorado County ZIPCode_County_Fresno County \
1179 0 0
932 0 0
792 0 0
2982 0 0
3420 0 0
3144 0 0
4868 0 0
3741 0 0
3501 0 0
3990 0 0
169 0 0
3067 0 0
4065 0 0
3277 0 0
4442 0 0
729 0 0
3630 0 0
1941 0 0
836 0 0
2762 0 0
1074 1 0
3792 0 0
3708 0 0
12 0 0
717 0 0
222 0 0
3735 0 0
394 0 0
1897 0 0
3409 0 0
4392 0 0
4554 0 0
2351 0 0
2685 0 0
1875 0 0
3023 0 0
2016 0 0
3274 0 0
4327 0 0
2070 0 0
4904 0 0
82 0 0
1565 0 0
486 0 0
4229 0 0
4571 0 0
256 0 0
1271 0 0
1523 0 0
2854 0 0
1147 0 0
3730 0 0
1045 0 0
3818 0 0
420 0 0
3685 0 0
3645 0 0
1030 0 0
3402 0 0
3042 0 0
1832 0 0
123 0 0
3468 0 0
2900 0 0
4762 0 0
560 0 0
2630 0 0
2665 0 0
3650 0 0
3064 0 0
3308 0 0
4925 0 0
290 0 0
990 0 0
4816 0 0
3186 0 0
4610 0 0
2738 0 0
3683 0 0
2470 0 0
2563 0 0
1837 0 0
719 0 0
2625 0 0
3544 0 0
4290 0 0
4665 0 0
1401 0 0
829 0 0
3505 0 0
ZIPCode_County_Humboldt County ZIPCode_County_Imperial County \
1179 0 0
932 0 0
792 0 0
2982 0 0
3420 0 0
3144 0 0
4868 0 0
3741 0 0
3501 0 0
3990 0 0
169 0 0
3067 0 0
4065 0 0
3277 0 0
4442 0 0
729 0 0
3630 0 0
1941 0 0
836 0 0
2762 0 0
1074 0 0
3792 0 0
3708 0 0
12 0 0
717 0 0
222 0 0
3735 0 0
394 0 0
1897 0 0
3409 0 0
4392 0 0
4554 0 0
2351 0 0
2685 0 0
1875 0 0
3023 0 0
2016 0 0
3274 0 0
4327 0 0
2070 0 0
4904 0 0
82 0 0
1565 0 0
486 0 0
4229 0 0
4571 0 0
256 0 0
1271 0 0
1523 0 0
2854 0 0
1147 0 0
3730 0 0
1045 0 0
3818 0 0
420 0 0
3685 0 0
3645 0 0
1030 0 0
3402 0 0
3042 0 0
1832 0 0
123 0 0
3468 0 0
2900 0 0
4762 0 0
560 0 0
2630 0 0
2665 0 0
3650 0 0
3064 0 0
3308 0 0
4925 0 0
290 0 0
990 0 0
4816 0 0
3186 0 0
4610 0 0
2738 0 0
3683 0 0
2470 0 0
2563 0 0
1837 0 0
719 0 0
2625 0 0
3544 0 0
4290 0 0
4665 0 0
1401 0 0
829 0 0
3505 0 0
ZIPCode_County_Kern County ZIPCode_County_Lake County \
1179 0 0
932 0 0
792 0 0
2982 0 0
3420 0 0
3144 0 0
4868 0 0
3741 0 0
3501 0 0
3990 0 0
169 0 0
3067 0 0
4065 0 0
3277 0 0
4442 0 0
729 0 0
3630 0 0
1941 0 0
836 0 0
2762 0 0
1074 0 0
3792 0 0
3708 0 0
12 0 0
717 0 0
222 0 0
3735 0 0
394 0 0
1897 0 0
3409 0 0
4392 0 0
4554 0 0
2351 0 0
2685 0 0
1875 0 0
3023 0 0
2016 0 0
3274 0 0
4327 0 0
2070 0 0
4904 0 0
82 0 0
1565 0 0
486 0 0
4229 0 0
4571 0 0
256 0 0
1271 0 0
1523 0 0
2854 0 0
1147 0 0
3730 0 0
1045 0 0
3818 0 0
420 0 0
3685 0 0
3645 0 0
1030 0 0
3402 0 0
3042 0 0
1832 0 0
123 0 0
3468 0 0
2900 0 0
4762 0 0
560 0 0
2630 0 0
2665 0 0
3650 0 0
3064 0 0
3308 0 0
4925 0 0
290 0 0
990 0 0
4816 0 0
3186 0 0
4610 0 0
2738 0 0
3683 0 0
2470 0 0
2563 0 0
1837 0 0
719 0 0
2625 0 0
3544 0 0
4290 0 0
4665 0 0
1401 0 0
829 0 0
3505 0 0
ZIPCode_County_Los Angeles County ZIPCode_County_Marin County \
1179 1 0
932 0 0
792 0 0
2982 0 0
3420 0 0
3144 1 0
4868 0 0
3741 0 0
3501 1 0
3990 0 0
169 1 0
3067 0 0
4065 0 0
3277 0 0
4442 1 0
729 0 0
3630 0 0
1941 0 0
836 0 0
2762 0 0
1074 0 0
3792 0 0
3708 0 0
12 0 0
717 0 0
222 0 0
3735 1 0
394 1 0
1897 0 0
3409 0 0
4392 0 0
4554 0 0
2351 0 0
2685 1 0
1875 1 0
3023 0 0
2016 0 0
3274 0 0
4327 1 0
2070 1 0
4904 0 0
82 0 0
1565 0 0
486 0 0
4229 0 0
4571 0 0
256 0 0
1271 0 0
1523 0 0
2854 0 0
1147 1 0
3730 0 0
1045 0 0
3818 0 0
420 0 0
3685 0 0
3645 0 0
1030 1 0
3402 1 0
3042 0 0
1832 1 0
123 0 0
3468 0 0
2900 0 0
4762 1 0
560 0 0
2630 0 0
2665 1 0
3650 1 0
3064 0 0
3308 0 0
4925 0 0
290 0 0
990 0 0
4816 0 0
3186 0 0
4610 1 0
2738 0 0
3683 0 0
2470 0 0
2563 1 0
1837 1 0
719 0 0
2625 0 0
3544 0 0
4290 0 0
4665 1 0
1401 0 0
829 1 0
3505 1 0
ZIPCode_County_Mendocino County ZIPCode_County_Merced County \
1179 0 0
932 0 0
792 0 0
2982 0 0
3420 0 0
3144 0 0
4868 0 0
3741 0 0
3501 0 0
3990 0 0
169 0 0
3067 0 0
4065 0 0
3277 0 0
4442 0 0
729 0 0
3630 0 0
1941 0 0
836 0 0
2762 0 0
1074 0 0
3792 0 0
3708 0 0
12 0 0
717 0 0
222 0 0
3735 0 0
394 0 0
1897 0 0
3409 0 0
4392 0 0
4554 0 0
2351 0 0
2685 0 0
1875 0 0
3023 0 0
2016 0 0
3274 0 0
4327 0 0
2070 0 0
4904 0 0
82 0 0
1565 0 0
486 0 0
4229 0 0
4571 0 0
256 0 0
1271 0 0
1523 0 0
2854 0 0
1147 0 0
3730 0 0
1045 0 0
3818 0 0
420 0 0
3685 0 0
3645 0 0
1030 0 0
3402 0 0
3042 0 0
1832 0 0
123 0 0
3468 0 0
2900 0 0
4762 0 0
560 0 0
2630 0 0
2665 0 0
3650 0 0
3064 0 0
3308 0 0
4925 0 0
290 0 0
990 0 0
4816 0 0
3186 0 0
4610 0 0
2738 0 0
3683 0 0
2470 0 0
2563 0 0
1837 0 0
719 0 0
2625 0 0
3544 0 0
4290 0 0
4665 0 0
1401 0 0
829 0 0
3505 0 0
ZIPCode_County_Monterey County ZIPCode_County_Napa County \
1179 0 0
932 0 0
792 0 0
2982 0 0
3420 0 0
3144 0 0
4868 0 0
3741 0 0
3501 0 0
3990 0 0
169 0 0
3067 0 0
4065 0 0
3277 0 0
4442 0 0
729 0 0
3630 0 0
1941 0 0
836 0 0
2762 0 0
1074 0 0
3792 0 0
3708 0 0
12 0 0
717 1 0
222 0 0
3735 0 0
394 0 0
1897 0 0
3409 0 0
4392 0 0
4554 0 0
2351 0 0
2685 0 0
1875 0 0
3023 0 0
2016 0 0
3274 0 0
4327 0 0
2070 0 0
4904 0 0
82 0 0
1565 0 0
486 0 0
4229 0 0
4571 0 0
256 0 0
1271 0 0
1523 0 0
2854 0 0
1147 0 0
3730 0 0
1045 0 0
3818 0 0
420 0 0
3685 0 0
3645 0 0
1030 0 0
3402 0 0
3042 0 0
1832 0 0
123 0 0
3468 1 0
2900 0 0
4762 0 0
560 1 0
2630 0 0
2665 0 0
3650 0 0
3064 0 0
3308 0 0
4925 0 0
290 0 0
990 0 0
4816 0 0
3186 0 0
4610 0 0
2738 0 0
3683 0 0
2470 0 0
2563 0 0
1837 0 0
719 0 0
2625 1 0
3544 0 0
4290 0 0
4665 0 0
1401 0 0
829 0 0
3505 0 0
ZIPCode_County_Orange County ZIPCode_County_Placer County \
1179 0 0
932 0 0
792 0 0
2982 0 0
3420 0 0
3144 0 0
4868 0 0
3741 0 0
3501 0 0
3990 0 0
169 0 0
3067 0 0
4065 0 0
3277 0 0
4442 0 0
729 0 0
3630 0 0
1941 0 0
836 0 0
2762 0 0
1074 0 0
3792 1 0
3708 0 0
12 0 0
717 0 0
222 0 0
3735 0 0
394 0 0
1897 0 0
3409 0 0
4392 0 0
4554 0 0
2351 0 0
2685 0 0
1875 0 0
3023 1 0
2016 1 0
3274 0 0
4327 0 0
2070 0 0
4904 0 0
82 0 0
1565 0 0
486 0 0
4229 0 0
4571 0 0
256 1 0
1271 0 0
1523 0 0
2854 0 0
1147 0 0
3730 0 0
1045 0 0
3818 0 0
420 0 0
3685 0 0
3645 0 0
1030 0 0
3402 0 0
3042 0 0
1832 0 0
123 0 0
3468 0 0
2900 0 0
4762 0 0
560 0 0
2630 0 0
2665 0 0
3650 0 0
3064 0 0
3308 0 0
4925 0 0
290 0 0
990 0 0
4816 0 0
3186 0 0
4610 0 0
2738 0 0
3683 0 0
2470 0 0
2563 0 0
1837 0 0
719 0 0
2625 0 0
3544 0 0
4290 0 0
4665 0 0
1401 0 0
829 0 0
3505 0 0
ZIPCode_County_Riverside County ZIPCode_County_Sacramento County \
1179 0 0
932 0 0
792 0 0
2982 0 0
3420 0 0
3144 0 0
4868 0 0
3741 0 0
3501 0 0
3990 0 0
169 0 0
3067 0 0
4065 0 0
3277 0 0
4442 0 0
729 0 0
3630 0 0
1941 0 0
836 0 0
2762 0 0
1074 0 0
3792 0 0
3708 0 0
12 0 0
717 0 0
222 0 0
3735 0 0
394 0 0
1897 0 0
3409 0 0
4392 0 0
4554 0 0
2351 0 0
2685 0 0
1875 0 0
3023 0 0
2016 0 0
3274 0 0
4327 0 0
2070 0 0
4904 0 0
82 1 0
1565 0 1
486 0 0
4229 0 0
4571 0 0
256 0 0
1271 0 0
1523 0 0
2854 0 0
1147 0 0
3730 0 0
1045 0 0
3818 0 0
420 0 0
3685 0 0
3645 0 0
1030 0 0
3402 0 0
3042 0 0
1832 0 0
123 0 0
3468 0 0
2900 0 0
4762 0 0
560 0 0
2630 0 0
2665 0 0
3650 0 0
3064 0 0
3308 0 0
4925 0 0
290 0 0
990 0 0
4816 0 0
3186 0 0
4610 0 0
2738 0 0
3683 0 0
2470 0 0
2563 0 0
1837 0 0
719 1 0
2625 0 0
3544 0 0
4290 0 0
4665 0 0
1401 0 0
829 0 0
3505 0 0
ZIPCode_County_San Benito County ZIPCode_County_San Bernardino County \
1179 0 0
932 0 0
792 0 0
2982 0 0
3420 0 0
3144 0 0
4868 0 0
3741 0 0
3501 0 0
3990 0 0
169 0 0
3067 0 0
4065 0 0
3277 0 0
4442 0 0
729 0 0
3630 0 0
1941 0 0
836 0 0
2762 0 0
1074 0 0
3792 0 0
3708 0 0
12 0 0
717 0 0
222 0 0
3735 0 0
394 0 0
1897 0 0
3409 0 0
4392 0 0
4554 0 0
2351 0 0
2685 0 0
1875 0 0
3023 0 0
2016 0 0
3274 0 0
4327 0 0
2070 0 0
4904 0 0
82 0 0
1565 0 0
486 0 0
4229 0 0
4571 0 0
256 0 0
1271 0 0
1523 0 0
2854 0 0
1147 0 0
3730 0 0
1045 0 0
3818 0 0
420 0 0
3685 0 0
3645 0 0
1030 0 0
3402 0 0
3042 0 0
1832 0 0
123 0 0
3468 0 0
2900 0 0
4762 0 0
560 0 0
2630 0 0
2665 0 0
3650 0 0
3064 0 0
3308 0 0
4925 0 0
290 0 1
990 0 0
4816 0 1
3186 0 0
4610 0 0
2738 0 0
3683 0 0
2470 0 0
2563 0 0
1837 0 0
719 0 0
2625 0 0
3544 0 0
4290 0 0
4665 0 0
1401 0 0
829 0 0
3505 0 0
ZIPCode_County_San Diego County ZIPCode_County_San Francisco County \
1179 0 0
932 0 0
792 0 0
2982 0 0
3420 0 0
3144 0 0
4868 0 0
3741 1 0
3501 0 0
3990 0 0
169 0 0
3067 0 0
4065 0 0
3277 1 0
4442 0 0
729 0 0
3630 0 0
1941 0 0
836 1 0
2762 0 0
1074 0 0
3792 0 0
3708 1 0
12 0 0
717 0 0
222 0 0
3735 0 0
394 0 0
1897 0 0
3409 0 0
4392 0 0
4554 0 0
2351 0 0
2685 0 0
1875 0 0
3023 0 0
2016 0 0
3274 1 0
4327 0 0
2070 0 0
4904 0 0
82 0 0
1565 0 0
486 1 0
4229 0 0
4571 0 0
256 0 0
1271 1 0
1523 1 0
2854 0 0
1147 0 0
3730 1 0
1045 1 0
3818 0 0
420 0 0
3685 0 0
3645 1 0
1030 0 0
3402 0 0
3042 0 0
1832 0 0
123 1 0
3468 0 0
2900 0 0
4762 0 0
560 0 0
2630 0 0
2665 0 0
3650 0 0
3064 0 0
3308 1 0
4925 0 0
290 0 0
990 0 0
4816 0 0
3186 0 0
4610 0 0
2738 0 0
3683 0 0
2470 1 0
2563 0 0
1837 0 0
719 0 0
2625 0 0
3544 1 0
4290 0 0
4665 0 0
1401 0 0
829 0 0
3505 0 0
ZIPCode_County_San Joaquin County \
1179 0
932 0
792 0
2982 0
3420 0
3144 0
4868 0
3741 0
3501 0
3990 0
169 0
3067 0
4065 0
3277 0
4442 0
729 0
3630 0
1941 0
836 0
2762 0
1074 0
3792 0
3708 0
12 0
717 0
222 0
3735 0
394 0
1897 0
3409 0
4392 0
4554 0
2351 0
2685 0
1875 0
3023 0
2016 0
3274 0
4327 0
2070 0
4904 0
82 0
1565 0
486 0
4229 0
4571 0
256 0
1271 0
1523 0
2854 0
1147 0
3730 0
1045 0
3818 0
420 0
3685 0
3645 0
1030 0
3402 0
3042 0
1832 0
123 0
3468 0
2900 0
4762 0
560 0
2630 0
2665 0
3650 0
3064 0
3308 0
4925 0
290 0
990 0
4816 0
3186 0
4610 0
2738 0
3683 0
2470 0
2563 0
1837 0
719 0
2625 0
3544 0
4290 0
4665 0
1401 0
829 0
3505 0
ZIPCode_County_San Luis Obispo County ZIPCode_County_San Mateo County \
1179 0 0
932 0 0
792 0 0
2982 0 0
3420 0 0
3144 0 0
4868 0 1
3741 0 0
3501 0 0
3990 0 0
169 0 0
3067 0 0
4065 0 0
3277 0 0
4442 0 0
729 0 0
3630 0 0
1941 0 0
836 0 0
2762 1 0
1074 0 0
3792 0 0
3708 0 0
12 0 0
717 0 0
222 0 0
3735 0 0
394 0 0
1897 0 0
3409 0 0
4392 0 0
4554 0 0
2351 0 0
2685 0 0
1875 0 0
3023 0 0
2016 0 0
3274 0 0
4327 0 0
2070 0 0
4904 0 0
82 0 0
1565 0 0
486 0 0
4229 0 0
4571 0 0
256 0 0
1271 0 0
1523 0 0
2854 0 1
1147 0 0
3730 0 0
1045 0 0
3818 0 0
420 0 0
3685 0 0
3645 0 0
1030 0 0
3402 0 0
3042 0 0
1832 0 0
123 0 0
3468 0 0
2900 0 0
4762 0 0
560 0 0
2630 0 0
2665 0 0
3650 0 0
3064 0 0
3308 0 0
4925 0 1
290 0 0
990 0 1
4816 0 0
3186 0 0
4610 0 0
2738 0 0
3683 0 0
2470 0 0
2563 0 0
1837 0 0
719 0 0
2625 0 0
3544 0 0
4290 0 0
4665 0 0
1401 0 0
829 0 0
3505 0 0
ZIPCode_County_Santa Barbara County ZIPCode_County_Santa Clara County \
1179 0 0
932 0 0
792 1 0
2982 0 1
3420 0 1
3144 0 0
4868 0 0
3741 0 0
3501 0 0
3990 0 1
169 0 0
3067 0 0
4065 0 1
3277 0 0
4442 0 0
729 1 0
3630 0 1
1941 0 0
836 0 0
2762 0 0
1074 0 0
3792 0 0
3708 0 0
12 1 0
717 0 0
222 0 1
3735 0 0
394 0 0
1897 0 0
3409 0 0
4392 0 0
4554 0 0
2351 0 0
2685 0 0
1875 0 0
3023 0 0
2016 0 0
3274 0 0
4327 0 0
2070 0 0
4904 0 1
82 0 0
1565 0 0
486 0 0
4229 0 0
4571 0 1
256 0 0
1271 0 0
1523 0 0
2854 0 0
1147 0 0
3730 0 0
1045 0 0
3818 0 1
420 1 0
3685 0 0
3645 0 0
1030 0 0
3402 0 0
3042 0 1
1832 0 0
123 0 0
3468 0 0
2900 0 0
4762 0 0
560 0 0
2630 0 0
2665 0 0
3650 0 0
3064 0 0
3308 0 0
4925 0 0
290 0 0
990 0 0
4816 0 0
3186 0 1
4610 0 0
2738 0 1
3683 0 1
2470 0 0
2563 0 0
1837 0 0
719 0 0
2625 0 0
3544 0 0
4290 0 0
4665 0 0
1401 0 0
829 0 0
3505 0 0
ZIPCode_County_Santa Cruz County ZIPCode_County_Shasta County \
1179 0 0
932 0 0
792 0 0
2982 0 0
3420 0 0
3144 0 0
4868 0 0
3741 0 0
3501 0 0
3990 0 0
169 0 0
3067 0 0
4065 0 0
3277 0 0
4442 0 0
729 0 0
3630 0 0
1941 0 0
836 0 0
2762 0 0
1074 0 0
3792 0 0
3708 0 0
12 0 0
717 0 0
222 0 0
3735 0 0
394 0 0
1897 0 0
3409 0 0
4392 0 0
4554 0 0
2351 0 0
2685 0 0
1875 0 0
3023 0 0
2016 0 0
3274 0 0
4327 0 0
2070 0 0
4904 0 0
82 0 0
1565 0 0
486 0 0
4229 0 0
4571 0 0
256 0 0
1271 0 0
1523 0 0
2854 0 0
1147 0 0
3730 0 0
1045 0 0
3818 0 0
420 0 0
3685 0 0
3645 0 0
1030 0 0
3402 0 0
3042 0 0
1832 0 0
123 0 0
3468 0 0
2900 0 0
4762 0 0
560 0 0
2630 0 0
2665 0 0
3650 0 0
3064 0 0
3308 0 0
4925 0 0
290 0 0
990 0 0
4816 0 0
3186 0 0
4610 0 0
2738 0 0
3683 0 0
2470 0 0
2563 0 0
1837 0 0
719 0 0
2625 0 0
3544 0 0
4290 0 0
4665 0 0
1401 0 0
829 0 0
3505 0 0
ZIPCode_County_Siskiyou County ZIPCode_County_Solano County \
1179 0 0
932 0 0
792 0 0
2982 0 0
3420 0 0
3144 0 0
4868 0 0
3741 0 0
3501 0 0
3990 0 0
169 0 0
3067 0 0
4065 0 0
3277 0 0
4442 0 0
729 0 0
3630 0 0
1941 0 0
836 0 0
2762 0 0
1074 0 0
3792 0 0
3708 0 0
12 0 0
717 0 0
222 0 0
3735 0 0
394 0 0
1897 0 0
3409 0 0
4392 0 0
4554 0 0
2351 0 0
2685 0 0
1875 0 0
3023 0 0
2016 0 0
3274 0 0
4327 0 0
2070 0 0
4904 0 0
82 0 0
1565 0 0
486 0 0
4229 0 0
4571 0 0
256 0 0
1271 0 0
1523 0 0
2854 0 0
1147 0 0
3730 0 0
1045 0 0
3818 0 0
420 0 0
3685 0 0
3645 0 0
1030 0 0
3402 0 0
3042 0 0
1832 0 0
123 0 0
3468 0 0
2900 0 0
4762 0 0
560 0 0
2630 0 0
2665 0 0
3650 0 0
3064 0 0
3308 0 0
4925 0 0
290 0 0
990 0 0
4816 0 0
3186 0 0
4610 0 0
2738 0 0
3683 0 0
2470 0 0
2563 0 0
1837 0 0
719 0 0
2625 0 0
3544 0 0
4290 0 0
4665 0 0
1401 0 0
829 0 0
3505 0 0
ZIPCode_County_Sonoma County ZIPCode_County_Stanislaus County \
1179 0 0
932 0 0
792 0 0
2982 0 0
3420 0 0
3144 0 0
4868 0 0
3741 0 0
3501 0 0
3990 0 0
169 0 0
3067 0 0
4065 0 0
3277 0 0
4442 0 0
729 0 0
3630 0 0
1941 0 1
836 0 0
2762 0 0
1074 0 0
3792 0 0
3708 0 0
12 0 0
717 0 0
222 0 0
3735 0 0
394 0 0
1897 0 0
3409 0 1
4392 0 0
4554 0 0
2351 0 0
2685 0 0
1875 0 0
3023 0 0
2016 0 0
3274 0 0
4327 0 0
2070 0 0
4904 0 0
82 0 0
1565 0 0
486 0 0
4229 0 0
4571 0 0
256 0 0
1271 0 0
1523 0 0
2854 0 0
1147 0 0
3730 0 0
1045 0 0
3818 0 0
420 0 0
3685 0 0
3645 0 0
1030 0 0
3402 0 0
3042 0 0
1832 0 0
123 0 0
3468 0 0
2900 0 0
4762 0 0
560 0 0
2630 0 0
2665 0 0
3650 0 0
3064 0 0
3308 0 0
4925 0 0
290 0 0
990 0 0
4816 0 0
3186 0 0
4610 0 0
2738 0 0
3683 0 0
2470 0 0
2563 0 0
1837 0 0
719 0 0
2625 0 0
3544 0 0
4290 0 0
4665 0 0
1401 0 0
829 0 0
3505 0 0
ZIPCode_County_Trinity County ZIPCode_County_Tuolumne County \
1179 0 0
932 0 0
792 0 0
2982 0 0
3420 0 0
3144 0 0
4868 0 0
3741 0 0
3501 0 0
3990 0 0
169 0 0
3067 0 0
4065 0 0
3277 0 0
4442 0 0
729 0 0
3630 0 0
1941 0 0
836 0 0
2762 0 0
1074 0 0
3792 0 0
3708 0 0
12 0 0
717 0 0
222 0 0
3735 0 0
394 0 0
1897 0 0
3409 0 0
4392 0 0
4554 0 0
2351 0 0
2685 0 0
1875 0 0
3023 0 0
2016 0 0
3274 0 0
4327 0 0
2070 0 0
4904 0 0
82 0 0
1565 0 0
486 0 0
4229 0 0
4571 0 0
256 0 0
1271 0 0
1523 0 0
2854 0 0
1147 0 0
3730 0 0
1045 0 0
3818 0 0
420 0 0
3685 0 0
3645 0 0
1030 0 0
3402 0 0
3042 0 0
1832 0 0
123 0 0
3468 0 0
2900 0 0
4762 0 0
560 0 0
2630 0 0
2665 0 0
3650 0 0
3064 0 0
3308 0 0
4925 0 0
290 0 0
990 0 0
4816 0 0
3186 0 0
4610 0 0
2738 0 0
3683 0 0
2470 0 0
2563 0 0
1837 0 0
719 0 0
2625 0 0
3544 0 0
4290 0 0
4665 0 0
1401 0 0
829 0 0
3505 0 0
ZIPCode_County_Unknown ZIPCode_County_Ventura County \
1179 0 0
932 0 0
792 0 0
2982 0 0
3420 0 0
3144 0 0
4868 0 0
3741 0 0
3501 0 0
3990 0 0
169 0 0
3067 0 0
4065 0 0
3277 0 0
4442 0 0
729 0 0
3630 0 0
1941 0 0
836 0 0
2762 0 0
1074 0 0
3792 0 0
3708 0 0
12 0 0
717 0 0
222 0 0
3735 0 0
394 0 0
1897 0 1
3409 0 0
4392 1 0
4554 0 0
2351 0 0
2685 0 0
1875 0 0
3023 0 0
2016 0 0
3274 0 0
4327 0 0
2070 0 0
4904 0 0
82 0 0
1565 0 0
486 0 0
4229 0 0
4571 0 0
256 0 0
1271 0 0
1523 0 0
2854 0 0
1147 0 0
3730 0 0
1045 0 0
3818 0 0
420 0 0
3685 0 0
3645 0 0
1030 0 0
3402 0 0
3042 0 0
1832 0 0
123 0 0
3468 0 0
2900 0 1
4762 0 0
560 0 0
2630 0 0
2665 0 0
3650 0 0
3064 0 0
3308 0 0
4925 0 0
290 0 0
990 0 0
4816 0 0
3186 0 0
4610 0 0
2738 0 0
3683 0 0
2470 0 0
2563 0 0
1837 0 0
719 0 0
2625 0 0
3544 0 0
4290 0 0
4665 0 0
1401 0 0
829 0 0
3505 0 0
ZIPCode_County_Yolo County ActualLabel PredictedLabel
1179 0 0 1
932 0 0 1
792 0 0 1
2982 0 0 1
3420 0 0 1
3144 0 0 1
4868 0 0 1
3741 0 0 1
3501 0 0 1
3990 0 0 1
169 0 0 1
3067 0 0 1
4065 0 0 1
3277 0 0 1
4442 0 0 1
729 0 0 1
3630 0 0 1
1941 0 0 1
836 0 0 1
2762 0 0 1
1074 0 0 1
3792 0 0 1
3708 0 0 1
12 0 0 1
717 0 0 1
222 0 0 1
3735 0 0 1
394 0 0 1
1897 0 0 1
3409 0 0 1
4392 0 0 1
4554 0 0 1
2351 0 0 1
2685 0 0 1
1875 0 0 1
3023 0 0 1
2016 0 0 1
3274 0 0 1
4327 0 0 1
2070 0 0 1
4904 0 0 1
82 0 0 1
1565 0 0 1
486 0 0 1
4229 0 0 1
4571 0 0 1
256 0 0 1
1271 0 0 1
1523 0 0 1
2854 0 0 1
1147 0 0 1
3730 0 0 1
1045 0 0 1
3818 0 0 1
420 0 0 1
3685 0 0 1
3645 0 0 1
1030 0 0 1
3402 0 0 1
3042 0 0 1
1832 0 0 1
123 0 0 1
3468 0 0 1
2900 0 0 1
4762 0 0 1
560 0 0 1
2630 0 0 1
2665 0 0 1
3650 0 0 1
3064 0 0 1
3308 0 0 1
4925 0 0 1
290 0 0 1
990 0 0 1
4816 0 0 1
3186 0 0 1
4610 0 0 1
2738 0 0 1
3683 0 0 1
2470 0 0 1
2563 0 0 1
1837 0 0 1
719 0 0 1
2625 0 0 1
3544 0 0 1
4290 0 0 1
4665 0 0 1
1401 0 0 1
829 0 0 1
3505 0 0 1
const Age Experience Income CCAvg Mortgage Family_2 Family_3 \
322 1.0 63 39.0 101.0 3.9 0.0 0 0
1126 1.0 32 8.0 104.0 3.7 0.0 1 0
2539 1.0 32 7.0 98.0 4.2 171.0 0 0
Family_4 Education_Graduate Education_Advanced/Professional \
322 0 0 0
1126 0 0 0
2539 0 0 0
Securities_Account_1 CD_Account_1 Online_1 CreditCard_1 \
322 1 1 1 0
1126 0 0 0 1
2539 1 1 0 0
ZIPCode_County_Butte County ZIPCode_County_Contra Costa County \
322 0 0
1126 0 0
2539 1 0
ZIPCode_County_El Dorado County ZIPCode_County_Fresno County \
322 0 0
1126 0 0
2539 0 0
ZIPCode_County_Humboldt County ZIPCode_County_Imperial County \
322 0 0
1126 0 0
2539 0 0
ZIPCode_County_Kern County ZIPCode_County_Lake County \
322 0 0
1126 0 0
2539 0 0
ZIPCode_County_Los Angeles County ZIPCode_County_Marin County \
322 0 0
1126 0 0
2539 0 0
ZIPCode_County_Mendocino County ZIPCode_County_Merced County \
322 0 0
1126 0 0
2539 0 0
ZIPCode_County_Monterey County ZIPCode_County_Napa County \
322 0 0
1126 0 0
2539 0 0
ZIPCode_County_Orange County ZIPCode_County_Placer County \
322 0 0
1126 0 0
2539 0 0
ZIPCode_County_Riverside County ZIPCode_County_Sacramento County \
322 0 0
1126 0 0
2539 0 0
ZIPCode_County_San Benito County ZIPCode_County_San Bernardino County \
322 0 0
1126 0 0
2539 0 0
ZIPCode_County_San Diego County ZIPCode_County_San Francisco County \
322 1 0
1126 0 0
2539 0 0
ZIPCode_County_San Joaquin County \
322 0
1126 0
2539 0
ZIPCode_County_San Luis Obispo County ZIPCode_County_San Mateo County \
322 0 0
1126 0 0
2539 0 0
ZIPCode_County_Santa Barbara County ZIPCode_County_Santa Clara County \
322 0 0
1126 0 1
2539 0 0
ZIPCode_County_Santa Cruz County ZIPCode_County_Shasta County \
322 0 0
1126 0 0
2539 0 0
ZIPCode_County_Siskiyou County ZIPCode_County_Solano County \
322 0 0
1126 0 0
2539 0 0
ZIPCode_County_Sonoma County ZIPCode_County_Stanislaus County \
322 0 0
1126 0 0
2539 0 0
ZIPCode_County_Trinity County ZIPCode_County_Tuolumne County \
322 0 0
1126 0 0
2539 0 0
ZIPCode_County_Unknown ZIPCode_County_Ventura County \
322 0 0
1126 0 0
2539 0 0
ZIPCode_County_Yolo County ActualLabel PredictedLabel
322 0 1 0
1126 0 1 0
2539 0 1 0
For the False negatives, we can see that the Family size of the mispredicted values is either 1 or 2.
For the False positives, since the dataset has people with lower income as well in the Education level category Advanced, there are bunch of wrong predictions there.
best_model2 = DecisionTreeClassifier(ccp_alpha=0.029,
class_weight={0: 0.10, 1: 0.90}, random_state=1)
best_model2.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.029, class_weight={0: 0.1, 1: 0.9},
random_state=1)
make_confusion_matrix_dt(best_model2,y_test)
get_recall_score(best_model2)
Recall on training set : 0.9546827794561934 Recall on test set : 0.9328859060402684
plt.figure(figsize=(15,10))
out = tree.plot_tree(best_model2,feature_names=feature_names,filled=True,fontsize=9,node_ids=False,class_names=None)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model2,feature_names=feature_names,show_weights=True))
|--- Income <= 92.50 | |--- weights: [255.20, 13.50] class: 0 |--- Income > 92.50 | |--- weights: [61.70, 284.40] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(best_model2.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp Income 1.0 const 0.0 ZIPCode_County_Santa Barbara County 0.0 ZIPCode_County_Orange County 0.0 ZIPCode_County_Placer County 0.0 ZIPCode_County_Riverside County 0.0 ZIPCode_County_Sacramento County 0.0 ZIPCode_County_San Benito County 0.0 ZIPCode_County_San Bernardino County 0.0 ZIPCode_County_San Diego County 0.0 ZIPCode_County_San Francisco County 0.0 ZIPCode_County_San Joaquin County 0.0 ZIPCode_County_San Luis Obispo County 0.0 ZIPCode_County_San Mateo County 0.0 ZIPCode_County_Santa Clara County 0.0 ZIPCode_County_Monterey County 0.0 ZIPCode_County_Santa Cruz County 0.0 ZIPCode_County_Shasta County 0.0 ZIPCode_County_Siskiyou County 0.0 ZIPCode_County_Solano County 0.0 ZIPCode_County_Sonoma County 0.0 ZIPCode_County_Stanislaus County 0.0 ZIPCode_County_Trinity County 0.0 ZIPCode_County_Tuolumne County 0.0 ZIPCode_County_Unknown 0.0 ZIPCode_County_Ventura County 0.0 ZIPCode_County_Napa County 0.0 ZIPCode_County_Merced County 0.0 Age 0.0 Online_1 0.0 Experience 0.0 CCAvg 0.0 Mortgage 0.0 Family_2 0.0 Family_3 0.0 Family_4 0.0 Education_Graduate 0.0 Education_Advanced/Professional 0.0 Securities_Account_1 0.0 CD_Account_1 0.0 CreditCard_1 0.0 ZIPCode_County_Mendocino County 0.0 ZIPCode_County_Butte County 0.0 ZIPCode_County_Contra Costa County 0.0 ZIPCode_County_El Dorado County 0.0 ZIPCode_County_Fresno County 0.0 ZIPCode_County_Humboldt County 0.0 ZIPCode_County_Imperial County 0.0 ZIPCode_County_Kern County 0.0 ZIPCode_County_Lake County 0.0 ZIPCode_County_Los Angeles County 0.0 ZIPCode_County_Marin County 0.0 ZIPCode_County_Yolo County 0.0
comparison_frame = pd.DataFrame({'Model':['Initial decision tree model','Decision tree with hyperparameter tuning',
'Decision tree with post-pruning'], 'Train_Recall':[1,0.87,0.98], 'Test_Recall':[0.85,0.73,0.97]})
comparison_frame
| Model | Train_Recall | Test_Recall | |
|---|---|---|---|
| 0 | Initial decision tree model | 1.00 | 0.85 |
| 1 | Decision tree with hyperparameter tuning | 0.87 | 0.73 |
| 2 | Decision tree with post-pruning | 0.98 | 0.97 |
According to the decision tree model -
If a customer's Income is less than 92.5k then there is a very high chance that the customer is not going to buy a loan from the bank.
If a customer's Income is greater than 92.5k and his Education level is Advanced/Professional(level 3) then there is a very high chance that the customer is going to buy a loan from the bank.
It is observed that the family size of 3 and 4 members has the likelihood of buying a loan.Those customers can be targeted by the marketing team as potential customers.
Employ the predictive model to predict potential customers (customers who can buy the product), and market Offers and deals on a real-time basis only to those customers.
It is observed that 60% of the customers have online account.Hence making attractive advertisements online with competitive offers/deals can attract more customers to buy the loan.